1 EE/CPRE 465 VLSI Design Process
2 Outline Design Partitioning Design process: MIPS Processor as an example –Architecture Design –Microarchitecture Design –Logic Design –Circuit Design –Physical Design Fabrication, Packaging, Testing
3 Coping with Complexity How to design System-on-Chip? –Many millions (even billions!) of transistors –Tens to hundreds of engineers Structured Design Partitioning of Design Process
4 Structured Design Hierarchy: Divide and Conquer –Recursively partition a system into modules Regularity –Reuse modules wherever possible –Example: Uniformly sized transistors at circuit level Standard cell library at gate level Modularity: well-formed interfaces –Allows modules to be treated as black boxes Locality –Physical and temporal
5 Partitioning of Design Process Architecture Design: User’s perspective, what does it do? –Instruction set, register set, and memory model –MIPS, x86, PIC, ARM, Power, SPARC, Alpha,… Microarchitecture Design: how the architecture is partitioned into registers and functional units –Single cycle, multcycle, pipelined, superscalar? –For x86: 386, 486, Pentium, PII, PIII, P4, Core, Core 2, Atom, Celeron, Cyrix MII, AMD K5, Athlon, Phenom Logic Design: how are functional blocks constructed –Ripple carry, carry lookahead, carry select adders Circuit Design: how are transistors used to implement the logic –Complementary CMOS, pass transistors, domino Physical Design: chip layout
Two Types of Engineers “Short and fat” engineers –Understand a large amount about a narrow field “Tall and skinny” engineers –Understand something about a broad range of topics Digital VLSI design favors the tall and skinny engineer –can evaluate how choices in one part of the system impact other parts of the system 6
7 MIPS Architecture Example: subset of MIPS processor architecture –Drawn from Patterson & Hennessy MIPS is a 32-bit architecture with 32 registers –Consider 8-bit subset using 8-bit datapath –Only implement 8 registers ($0 - $7) –$0 hardwired to –8-bit program counter Original MIPS Architecture Simplified MIPS Architecture here Data width32 bits8 bits Address width32 bits8 bits # of registers328 Instruction length32 bits
8 Instruction Set imm x
9 Instruction Encoding 32-bit instruction encoding –Requires four cycles to fetch on 8-bit datapath Note that the destination register is specified by: –Bits 15:11 for R-type instructions –Bits 20:16 for addi instruction
10 Fibonacci (C) f 0 = 1; f -1 = -1 f n = f n-1 + f n-2 f 1 =0, f 2 =1, f 3 =1, f 4 =2, f 5 =3,...
11 Fibonacci (Assembly)
12 Fibonacci (Binary) Machine language program
13 Multicycle MIPS Microarchitecture Shift left 2
14 Multicycle MIPS µ-arch (32-bit Design)
15 Multicycle Controller
16 Chapter 5 of Patterson and Hennessy (32-bit Design) Summary of Steps for Each Instruction Class Step name Action for R-type Instruction Action for load Instruction Action for store Instruction Action for branch Instruction Action for jump Instruction Instruction fetch IR <= Memory[PC] PC <= PC + 4 Instruction decode / register fetch A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= PC + (sign-extend(IR[15:0]) << 2) Execution / address computation / branch/jump completion ALUOut <= A op B ALUOut <= A + sign-extend(IR[15:0]) If (A==B) PC <= ALUOut PC <= {PC[31:28], IR[25:0], 2’b00} Memory access / R-type completion Reg[IR[15:11]] <= ALUOut MDR <= Memory[ALUOut] <= B Memory read completion Reg[IR[20:16]] <= MDR Become 4 steps in our 8-bit design
17 Chapter 5 of Patterson and Hennessy (32-bit Design) Instructions from ISA Perspective Consider each instruction from the perspective of ISA. Example: Add instruction –Instruction specified by the PC. –Operand registers are specified by bits 25:21 and 20:16 of the instruction –New value is the sum of two registers. –Register written is specified by bits 15:11 of instruction. Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]] PC <= PC + 4 In order to accomplish this we must break up the instruction. –kind of like introducing variables when programming ISA: Instruction Set Architecture
18 Chapter 5 of Patterson and Hennessy (32-bit Design) Breaking Down an Instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]] Could break down to: –IR <= Memory[PC] –A <= Reg[IR[25:21]] –B <= Reg[IR[20:16]] –ALUOut <= A + B –Reg[IR[15:11]] <= ALUOut Don’t forgot an important part of the definition of arithmetic! –PC <= PC + 4
19 Chapter 5 of Patterson and Hennessy (32-bit Design) Idea Behind Multicycle Approach We define each instruction from the ISA perspective Break it down into steps: –Balance the amount of work to be done in different steps –Restrict each cycle to use only one major functional unit Introduce new registers as needed –A, B, ALUOut, MDR, IR, etc. Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our book’s multicycle implementation!
20 Chapter 5 of Patterson and Hennessy (32-bit Design) 1.Instruction Fetch 2.Instruction Decode and Register Fetch 3.Execution, Memory Address Computation, or Branch / Jump Completion 4.Memory Access or R-type Instruction Completion 5.Memory Read Completion INSTRUCTIONS TAKE FROM CYCLES! Five Execution Steps 6-8 cycles in our 8-bit design Become 4 steps since we have an 8-bit design
21 Chapter 5 of Patterson and Hennessy (32-bit Design) Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using "Register-Transfer Language“ (RTL): IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? Step 1: Instruction Fetch Become 4 steps in our 8-bit design
22 Chapter 5 of Patterson and Hennessy (32-bit Design) Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); We are not setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) Step 2: Instruction Decode and Register Fetch
23 Chapter 5 of Patterson and Hennessy (32-bit Design) ALU is performing one of three functions, based on instruction type R-type: ALUOut <= A op B; Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); Branch: if (A==B) PC <= ALUOut; Jump : PC <= {PC[31:28], IR[25:0], 2’b00} Step 3: Execution, Memory Address Computation, or Branch / Jump Completion (Instruction Dependent)
24 Chapter 5 of Patterson and Hennessy (32-bit Design) Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions completion Reg[IR[15:11]] <= ALUOut; Step 4: Memory Access or R-type Instruction Completion
25 Chapter 5 of Patterson and Hennessy (32-bit Design) Reg[IR[20:16]] <= MDR; Step 5: Memory Read Completion
26 Logic Design Start at top level –Hierarchically decompose MIPS into units Top-level interface
27 Block Diagram
28 Hierarchical Design
29 HDLs Hardware Description Languages –Widely used in logic design –Verilog and VHDL Describe hardware using code –Document logic functions –Simulate logic before building –Synthesize code into gates and layout Requires a library of standard cells
30 Verilog Example module adder(input logic [7:0] a, b, input logic c, output logic [7:0] s, output logic cout); wire [6:0] carry; fulladder fa0(a[0], b[0], c, s[0], carry[0]); fulladder fa0(a[1], b[1], carry[0], s[1], carry[1]); fulladder fa0(a[2], b[2], carry[1], s[2], carry[2]);.... fulladder fa0(a[7], b[7], carry[6], s[7], cout); endmodule module fulladder(input logic a, b, c, output logic s, cout); sums1(a, b, c, s); carryc1(a, b, c, cout); endmodule module carry(input logic a, b, c, output logic cout) assign cout = (a&b) | (a&c) | (b&c); endmodule
31 Circuit Design How should logic be implemented? –NANDs and NORs vs. ANDs and ORs? –Fan-in and fan-out? –How wide should transistors be? These choices affect speed, area, power Logic synthesis makes these choices for you –Good enough for many applications –Hand-crafted circuits are still better
32 Example: Carry Logic assign cout = (a&b) | (a&c) | (b&c); Gate-level design: 26 transistors, 4 stages of gate delays
33 Example: Carry Logic assign cout = (a&b) | (a&c) | (b&c); Transistor-level design: 12 transistors, 2 stages of gate delays
34 Gate-level Netlist module carry(input a, b, c, output cout) wire x, y, z; and g1(x, a, b); and g2(y, a, c); and g3(z, b, c); or g4(cout, x, y, z); endmodule
35 Transistor-Level Netlist module carry(input a, b, c, output cout) wire i1, i2, i3, i4, cn; tranif1 n1(i1, 0, a); tranif1 n2(i1, 0, b); tranif1 n3(cn, i1, c); tranif1 n4(i2, 0, b); tranif1 n5(cn, i2, a); tranif0 p1(i3, 1, a); tranif0 p2(i3, 1, b); tranif0 p3(cn, i3, c); tranif0 p4(i4, 1, b); tranif0 p5(cn, i4, a); tranif1 n6(cout, 0, cn); tranif0 p6(cout, 1, cn); endmodule
36 SPICE Netlist.SUBCKT CARRY A B C COUT VDD GND MN1 I1 A GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5P MN2 I1 B GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5P MN3 CN C I1 GND NMOS W=1U L=0.18U AD=0.5P AS=0.5P MN4 I2 B GND GND NMOS W=1U L=0.18U AD=0.15P AS=0.5P MN5 CN A I2 GND NMOS W=1U L=0.18U AD=0.5P AS=0.15P MP1 I3 A VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1 P MP2 I3 B VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1P MP3 CN C I3 VDD PMOS W=2U L=0.18U AD=1P AS=1P MP4 I4 B VDD VDD PMOS W=2U L=0.18U AD=0.3P AS=1P MP5 CN A I4 VDD PMOS W=2U L=0.18U AD=1P AS=0.3P MN6 COUT CN GND GND NMOS W=2U L=0.18U AD=1P AS=1P MP6 COUT CN VDD VDD PMOS W=4U L=0.18U AD=2P AS=2P CI1 I1 GND 2FF CI3 I3 GND 3FF CA A GND 4FF CB B GND 4FF CC C GND 2FF CCN CN GND 4FF CCOUT COUT GND 2FF.ENDS
37 Physical Design Floorplan –Area estimation Place & route –Standard cells Datapaths –Slice planning
38 Synthesized MIPS Layout
39 MIPS Floorplan
40 Area Estimation Need area estimates to make floorplan –Compare to another block you already designed –Or estimate from transistor counts –Budget room for large wiring tracks –Your mileage may vary!
41 MIPS Layout
42 Standard Cells Uniform cell height Uniform well height M1 V DD and GND rails M2 Access to I/Os Well / substrate taps Exploits regularity
43 Synthesized Controller Synthesize HDL into gate-level netlist Place & Route using standard cell library
44 Snap-Together Cells Synthesized controller area is mostly wires –Design is smaller if wires run through/over cells –Smaller = faster, lower power as well! Design snap-together cells for datapaths and arrays –Plan wires into cells –Pitch Matching required –Connect by abutment Exploits locality Takes lots of effort
45 MIPS Datapath 8-bit datapath built from 8 bitslices (regularity) Zipper at top drives control signals to datapath
46 MIPS ALU Arithmetic / Logic Unit is part of bitslice
47 Slice Plans Slice plan for bitslice –Cell ordering, dimensions, wiring tracks –Arrange cells for wiring locality
48 Design Verification Fabrication is slow & expensive –MOSIS 0.6 m: $1000, 3 months –65 nm: $3M, 1 month Debugging chips is very hard –Limited visibility into operation Prove design is right before building! –Logic simulation –Ckt. simulation / Formal verification –Layout vs. schematic (LVS) comparison –Design & electrical rule checks (DRC & ERC) Verification is > 50% of effort on most chips!
49 Fabrication & Packaging Tapeout final layout Fabrication –6, 8, 12” wafers –Optimized for throughput, not latency (10 weeks!) –Cut into individual dice Packaging –Bond gold wires from die I/O pads to package
50 Testing Test that chip operates –Design errors –Manufacturing errors A single dust particle or wafer defect kills a die –Yields from 90% to < 10% –Depends on die size, maturity of process –Test each part before shipping to customer