Download presentation
Presentation is loading. Please wait.
Published byHarriet Greene Modified over 6 years ago
1
CS184a: Computer Architecture (Structure and Organization)
Day 9: January 26, 2005 Modeling Instruction Space and Empirical Comparisons Caltech CS184 Winter DeHon
2
Last Time Instruction Requirements Instruction Space
Caltech CS184 Winter DeHon
3
Architecture Instruction Taxonomy
Caltech CS184 Winter DeHon
4
Today Instructions Empirical Data Model Architecture Processors FPGAs
implied costs gross application characteristics Empirical Data Processors FPGAs Custom Gate Array Std. Cell Full Caltech CS184 Winter DeHon
5
Quotes If it can’t be expressed in figures, it is not science; it is opinion Lazarus Long Caltech CS184 Winter DeHon
6
Modeling Why do we model? Caltech CS184 Winter DeHon
7
Motivation Need to understand How costly (big) is a solution
How compare to alternatives Cost and benefit of flexibility Caltech CS184 Winter DeHon
8
What we really want: Complete implementation of our application
For each architectural alternatives In same implementation technology w/ multiple area-time points Caltech CS184 Winter DeHon
9
Reality Seldom get it packaged that nicely Deal with
much work to do so technology keeps moving Deal with estimation from components technology differences few area-time points Caltech CS184 Winter DeHon
10
Modeling Instruction Effects
Restrictions from “ideal” save area Restriction from “ideal” limits usability (yield) of PE Want to understand effects area model utilization/yield model Caltech CS184 Winter DeHon
11
Efficiency/Yield Intuition
What happens when Datapath is too wide? Datapath is too narrow? Instruction memory is too deep? Instruction memory is too shallow? Caltech CS184 Winter DeHon
12
Computing Device Composition Bit Processing elements
Interconnect: space Interconnect: time Instruction Memory Tile together to build device Caltech CS184 Winter DeHon
13
Relative Sizes Bit Operator 10-20Kl2
Bit Operator Interconnect K-1Ml2 Instruction (w/ interconnect) Kl2 Memory bit (SRAM) Kl2 Shown here are the ballpark sizes for each of the major components in these devices. Here, and elsewhere, I’ll be talking about feature sizes in normalized units of lambda, where lambda is half the minimum feature size in a VLSI process. This way I can talk about areas without tying my general results to any particular process. Lambda-squared, then is a normalized measure of area. These numbers are come from looking at the composition in VLSI layouts both constructively and decompositionally looking at existing artifacts. A bit operator, like an ALU bit or an FPGA LUT or even a gate, along with a reasonable slice of interconnect will take up 500K to 1M lambda-squared. The instruction which controls this compute and interconnect will be about one 10th to one 12th that size. A single memory bit for holding data will be fairly small, but these do add up when we collect a large number of them together. Shown here is the relative area for these components. Note that the operator itself is generally negligible in area compared to the interconnect. As we’ve noted the instruction is an order of magnitude smaller than the interconnect, but a large number of these can easily dominate the interconnect. Similarly for memory bits, they’re small by themselves but can rapidly add up to consume significant amounts of area. So, now let’s look at how these areas impact what we can build in fixed area. Caltech CS184 Winter DeHon
14
Model Area Caltech CS184 Winter DeHon
15
Calibrate Model Caltech CS184 Winter DeHon
16
Peak Densities from Model
Only 2 of 4 parameters small slice of space 100 density across Large difference in peak densities large design space! Caltech CS184 Winter DeHon
17
Efficiency What do we want to maximize? Yield Fraction / Area
Useful work per unit silicon (not potential/peak work) Yield Fraction / Area (or minimize (Area/Yield) ) Caltech CS184 Winter DeHon
18
Efficiency For comparison, look at relative efficiency to ideal.
Ideal = architecture exactly matched to application requirements Efficiency = Aideal/Aarch Aarch = Area Op/Yield Caltech CS184 Winter DeHon
19
Efficiency Calculation
Caltech CS184 Winter DeHon
20
Efficiency: Width Mismatch
16K PEs Caltech CS184 Winter DeHon
21
Path Length How many primitive-operator delays before can perform next operation? Reuse the resource Caltech CS184 Winter DeHon
22
Reuse Pipeline and reuse at primitive-operator delay level.
How many times can I reuse each primitive operator? Path Length: How much sequentialization Is allowed (required)? Caltech CS184 Winter DeHon
23
Context Depth Caltech CS184 Winter DeHon
24
Efficiency with fixed Width
Path Length Context Depth w=1, 16K PEs Caltech CS184 Winter DeHon
25
Ideal Efficiency (different model)
Two resources here: active processing elements operation description/state Applications need in different proportions. Application Requirement …make sure to define lpath Caltech CS184 Winter DeHon
26
Robust Point depend on Width
Caltech CS184 Winter DeHon
27
Processors and FPGAs c=d=1024, w=64, k=2 c=d=1, w=1, k=4 “Processor”
Caltech CS184 Winter DeHon
28
Intermediate Architecture
w=8 c=64 16K PEs Hard to be robust across entire space… Caltech CS184 Winter DeHon
29
Caveats Model abstracts away many details which are important
interconnect (day ) control (day 21) specialized functional units (next time) Applications are a heterogeneous mix of characteristics Caltech CS184 Winter DeHon
30
Modeling Message Architecture space is huge
Easy to be very inefficient Hard to pick one point robust across entire space Why we have so many architectures? Caltech CS184 Winter DeHon
31
General Message Parameterize architectures Look at continuum
costs benefits Often have competing effects leads to maxima/minima Caltech CS184 Winter DeHon
32
Admin Going forward and back in lecture slides
Handing back assignments Caltech CS184 Winter DeHon
33
Big Ideas [MSB Ideas] Applications typically have structure
Exploit this structure to reduce resource requirements Architecture is about understanding and exploiting structure and costs to reduce requirements Caltech CS184 Winter DeHon
34
Big Ideas [MSB Ideas] Instruction organization induces a design space (taxonomy) for programmable architectures Arch. structure and application requirements mismatch inefficiencies Model visualize efficiency trends Architecture space is huge can be very inefficient need to learn to navigate Caltech CS184 Winter DeHon
35
Empirical Comparisons
Caltech CS184 Winter DeHon
36
Empirical Ground modeling in some concretes Start sorting out
custom vs. configurable spatial configurable vs. temporal Caltech CS184 Winter DeHon
37
Full Custom Get to define all layers Use any geometry you like
Only rules are process design rules CS181 Caltech CS184 Winter DeHon
38
Standard Cell Area inv nand3 inv AOI4 nor3 Inv All cells uniform
height Width of channel determined by routing Cell area Identify the full custom and standard cell regions on 386DX die Caltech CS184 Winter DeHon
39
MPGA Metal Programmable Gate Array Gates pre-placed (poly, diffusion)
Only get to define metal connections Cheap – only have to pay for metal mask(s) Caltech CS184 Winter DeHon
40
MPGA vs. Custom? AMI CICC’83 Toshiba DSP Mosaid RAM GE CICC’86
Std-Cell 0.7 Custom 0.5 Toshiba DSP Custom 0.3 Mosaid RAM Custom 0.2 GE CICC’86 MPGA 1.0 Std-Cell FF/counter 0.7 FullAdder 0.4 RAM 0.2 MPGA = Metal Programmable Gate Array (traditional Gate Array) Caltech CS184 Winter DeHon
41
Metal Programmable Gate Arrays
Caltech CS184 Winter DeHon
42
MPGAs Modern -- “Sea of Gates” yield 35--70% maybe 5kl2/gate ?
(quite a bit of variance) Caltech CS184 Winter DeHon
43
FPGA Table Caltech CS184 Winter DeHon
44
Modern FPGAs 1.25Ml2/LE 1. 5Ml2/4-LUT APEX 20K1500E 52K LEs 0.18mm
24mm 22mm 1.25Ml2/LE XC2V1000 10.44mm x 9.90mm [source: Chipworks] 0.15mm 11,520 4-LUTs 1. 5Ml2/4-LUT [Both also have RAM in cited area] Caltech CS184 Winter DeHon
45
Conventional FPGA Tile
K-LUT (typical k=4) w/ optional output Flip-Flop Caltech CS184 Winter DeHon
46
Toronto FPGA Model Caltech CS184 Winter DeHon
47
How many gates? Caltech CS184 Winter DeHon
48
“gates” in 2-LUT Caltech CS184 Winter DeHon
49
Now how many? Caltech CS184 Winter DeHon
50
Which gives: More usable gates? More gates/unit area?
Caltech CS184 Winter DeHon
51
Gates Required? Depth=3, Depth=2048? Caltech CS184 Winter DeHon
52
Gate metric for FPGAs? Day8: several components for computations
compute element interconnect: space time instructions Not all applications need in same balance Assigning a single “capacity” number to device is an oversimplification Caltech CS184 Winter DeHon
53
MPGA vs. FPGA Adding ~2x Custom/MPGA, Custom/FPGA ~10x MPGA (SOG GA)
5Kl2/gate 35-70% usable (50%) 7-17Kl2/gate net Ratio: (5) Xilinx XC4K 1.25Ml2 /CLB gates (26?) 26-73Kl2/gate net Adding ~2x Custom/MPGA, Custom/FPGA ~10x Caltech CS184 Winter DeHon
54
MPGA vs. FPGA MPGA (SOG GA) Ratio: 1--7 (2.5) Xilinx XC4K l=0.6m
tgd~1ns Ratio: (2.5) Xilinx XC4K l=0.6m 1-7 gates in 7ns 2-3 gates typical Caltech CS184 Winter DeHon
55
Processors vs. FPGAs Caltech CS184 Winter DeHon
56
Processors and FPGAs Caltech CS184 Winter DeHon
57
Component Example Single die in 0.35mm XC4085XL-09 3,136 CLBs 4.6ns
682 Bit Ops/ns Alpha 64b ALUs ns 55.7 Bit Ops/ns [1 “bit op” = 2 gate evaluations] Caltech CS184 Winter DeHon
58
Processors and FPGAs Caltech CS184 Winter DeHon
59
Raw Density Summary Area Area-Time MPGA 2-3x Custom FPGA 5x MPGA
Gate Array 6-10x Custom FPGA 15-20x Gate Array Processor 10x FPGA Caltech CS184 Winter DeHon
60
Raw Density Caveats Processor/FPGA may solve more specialized problem
Problems have different resource balance requirements …can lead to low yield of raw density Caltech CS184 Winter DeHon
61
Degrade from Peak Caltech CS184 Winter DeHon
62
Degrade from Peak: FPGAs
Long path length not run at cycle Limited throughput requirement bottlenecks elsewhere limit throughput req. Insufficient interconnect Insufficient retiming resources (bandwidth) Caltech CS184 Winter DeHon
63
Degrade from Peak: Processors
Ops w/ no gate evaluations (interconnect) Ops use limited word width Stalls waiting for retimed data Caltech CS184 Winter DeHon
64
Degrade from Peak: Custom/MPGA
Solve more general problem than required (more gates than really need) Long path length Limited throughput requirement Not needed or applicable to a problem Caltech CS184 Winter DeHon
65
Degrade Notes We’ll cover these issues in more detail as we get into them later in the course Caltech CS184 Winter DeHon
66
Big Ideas [MSB Ideas] Raw densities: custom:ga:fpga:processor
1:5:100:1000 close gap with specialization Caltech CS184 Winter DeHon
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.