Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS184a: Computer Architecture (Structure and Organization)

Similar presentations


Presentation on theme: "CS184a: Computer Architecture (Structure and Organization)"— Presentation transcript:

1 CS184a: Computer Architecture (Structure and Organization)
Day 9: January 26, 2005 Modeling Instruction Space and Empirical Comparisons Caltech CS184 Winter DeHon

2 Last Time Instruction Requirements Instruction Space
Caltech CS184 Winter DeHon

3 Architecture Instruction Taxonomy
Caltech CS184 Winter DeHon

4 Today Instructions Empirical Data Model Architecture Processors FPGAs
implied costs gross application characteristics Empirical Data Processors FPGAs Custom Gate Array Std. Cell Full Caltech CS184 Winter DeHon

5 Quotes If it can’t be expressed in figures, it is not science; it is opinion Lazarus Long Caltech CS184 Winter DeHon

6 Modeling Why do we model? Caltech CS184 Winter DeHon

7 Motivation Need to understand How costly (big) is a solution
How compare to alternatives Cost and benefit of flexibility Caltech CS184 Winter DeHon

8 What we really want: Complete implementation of our application
For each architectural alternatives In same implementation technology w/ multiple area-time points Caltech CS184 Winter DeHon

9 Reality Seldom get it packaged that nicely Deal with
much work to do so technology keeps moving Deal with estimation from components technology differences few area-time points Caltech CS184 Winter DeHon

10 Modeling Instruction Effects
Restrictions from “ideal” save area Restriction from “ideal” limits usability (yield) of PE Want to understand effects area model utilization/yield model Caltech CS184 Winter DeHon

11 Efficiency/Yield Intuition
What happens when Datapath is too wide? Datapath is too narrow? Instruction memory is too deep? Instruction memory is too shallow? Caltech CS184 Winter DeHon

12 Computing Device Composition Bit Processing elements
Interconnect: space Interconnect: time Instruction Memory Tile together to build device Caltech CS184 Winter DeHon

13 Relative Sizes Bit Operator 10-20Kl2
Bit Operator Interconnect K-1Ml2 Instruction (w/ interconnect) Kl2 Memory bit (SRAM) Kl2 Shown here are the ballpark sizes for each of the major components in these devices. Here, and elsewhere, I’ll be talking about feature sizes in normalized units of lambda, where lambda is half the minimum feature size in a VLSI process. This way I can talk about areas without tying my general results to any particular process. Lambda-squared, then is a normalized measure of area. These numbers are come from looking at the composition in VLSI layouts both constructively and decompositionally looking at existing artifacts. A bit operator, like an ALU bit or an FPGA LUT or even a gate, along with a reasonable slice of interconnect will take up 500K to 1M lambda-squared. The instruction which controls this compute and interconnect will be about one 10th to one 12th that size. A single memory bit for holding data will be fairly small, but these do add up when we collect a large number of them together. Shown here is the relative area for these components. Note that the operator itself is generally negligible in area compared to the interconnect. As we’ve noted the instruction is an order of magnitude smaller than the interconnect, but a large number of these can easily dominate the interconnect. Similarly for memory bits, they’re small by themselves but can rapidly add up to consume significant amounts of area. So, now let’s look at how these areas impact what we can build in fixed area. Caltech CS184 Winter DeHon

14 Model Area Caltech CS184 Winter DeHon

15 Calibrate Model Caltech CS184 Winter DeHon

16 Peak Densities from Model
Only 2 of 4 parameters small slice of space 100 density across Large difference in peak densities large design space! Caltech CS184 Winter DeHon

17 Efficiency What do we want to maximize? Yield Fraction / Area
Useful work per unit silicon (not potential/peak work) Yield Fraction / Area (or minimize (Area/Yield) ) Caltech CS184 Winter DeHon

18 Efficiency For comparison, look at relative efficiency to ideal.
Ideal = architecture exactly matched to application requirements Efficiency = Aideal/Aarch Aarch = Area Op/Yield Caltech CS184 Winter DeHon

19 Efficiency Calculation
Caltech CS184 Winter DeHon

20 Efficiency: Width Mismatch
16K PEs Caltech CS184 Winter DeHon

21 Path Length How many primitive-operator delays before can perform next operation? Reuse the resource Caltech CS184 Winter DeHon

22 Reuse Pipeline and reuse at primitive-operator delay level.
How many times can I reuse each primitive operator? Path Length: How much sequentialization Is allowed (required)? Caltech CS184 Winter DeHon

23 Context Depth Caltech CS184 Winter DeHon

24 Efficiency with fixed Width
Path Length Context Depth w=1, 16K PEs Caltech CS184 Winter DeHon

25 Ideal Efficiency (different model)
Two resources here: active processing elements operation description/state Applications need in different proportions. Application Requirement …make sure to define lpath Caltech CS184 Winter DeHon

26 Robust Point depend on Width
Caltech CS184 Winter DeHon

27 Processors and FPGAs c=d=1024, w=64, k=2 c=d=1, w=1, k=4 “Processor”
Caltech CS184 Winter DeHon

28 Intermediate Architecture
w=8 c=64 16K PEs Hard to be robust across entire space… Caltech CS184 Winter DeHon

29 Caveats Model abstracts away many details which are important
interconnect (day ) control (day 21) specialized functional units (next time) Applications are a heterogeneous mix of characteristics Caltech CS184 Winter DeHon

30 Modeling Message Architecture space is huge
Easy to be very inefficient Hard to pick one point robust across entire space Why we have so many architectures? Caltech CS184 Winter DeHon

31 General Message Parameterize architectures Look at continuum
costs benefits Often have competing effects leads to maxima/minima Caltech CS184 Winter DeHon

32 Admin Going forward and back in lecture slides
Handing back assignments Caltech CS184 Winter DeHon

33 Big Ideas [MSB Ideas] Applications typically have structure
Exploit this structure to reduce resource requirements Architecture is about understanding and exploiting structure and costs to reduce requirements Caltech CS184 Winter DeHon

34 Big Ideas [MSB Ideas] Instruction organization induces a design space (taxonomy) for programmable architectures Arch. structure and application requirements mismatch  inefficiencies Model  visualize efficiency trends Architecture space is huge can be very inefficient need to learn to navigate Caltech CS184 Winter DeHon

35 Empirical Comparisons
Caltech CS184 Winter DeHon

36 Empirical Ground modeling in some concretes Start sorting out
custom vs. configurable spatial configurable vs. temporal Caltech CS184 Winter DeHon

37 Full Custom Get to define all layers Use any geometry you like
Only rules are process design rules CS181 Caltech CS184 Winter DeHon

38 Standard Cell Area inv nand3 inv AOI4 nor3 Inv All cells uniform
height Width of channel determined by routing Cell area Identify the full custom and standard cell regions on 386DX die Caltech CS184 Winter DeHon

39 MPGA Metal Programmable Gate Array Gates pre-placed (poly, diffusion)
Only get to define metal connections Cheap – only have to pay for metal mask(s) Caltech CS184 Winter DeHon

40 MPGA vs. Custom? AMI CICC’83 Toshiba DSP Mosaid RAM GE CICC’86
Std-Cell 0.7 Custom 0.5 Toshiba DSP Custom 0.3 Mosaid RAM Custom 0.2 GE CICC’86 MPGA 1.0 Std-Cell FF/counter 0.7 FullAdder 0.4 RAM 0.2 MPGA = Metal Programmable Gate Array (traditional Gate Array) Caltech CS184 Winter DeHon

41 Metal Programmable Gate Arrays
Caltech CS184 Winter DeHon

42 MPGAs Modern -- “Sea of Gates” yield 35--70% maybe 5kl2/gate ?
(quite a bit of variance) Caltech CS184 Winter DeHon

43 FPGA Table Caltech CS184 Winter DeHon

44 Modern FPGAs 1.25Ml2/LE 1. 5Ml2/4-LUT APEX 20K1500E 52K LEs 0.18mm
24mm  22mm 1.25Ml2/LE XC2V1000 10.44mm x 9.90mm [source: Chipworks] 0.15mm 11,520 4-LUTs 1. 5Ml2/4-LUT [Both also have RAM in cited area] Caltech CS184 Winter DeHon

45 Conventional FPGA Tile
K-LUT (typical k=4) w/ optional output Flip-Flop Caltech CS184 Winter DeHon

46 Toronto FPGA Model Caltech CS184 Winter DeHon

47 How many gates? Caltech CS184 Winter DeHon

48 “gates” in 2-LUT Caltech CS184 Winter DeHon

49 Now how many? Caltech CS184 Winter DeHon

50 Which gives: More usable gates? More gates/unit area?
Caltech CS184 Winter DeHon

51 Gates Required? Depth=3, Depth=2048? Caltech CS184 Winter DeHon

52 Gate metric for FPGAs? Day8: several components for computations
compute element interconnect: space time instructions Not all applications need in same balance Assigning a single “capacity” number to device is an oversimplification Caltech CS184 Winter DeHon

53 MPGA vs. FPGA Adding ~2x Custom/MPGA, Custom/FPGA ~10x MPGA (SOG GA)
5Kl2/gate 35-70% usable (50%) 7-17Kl2/gate net Ratio: (5) Xilinx XC4K 1.25Ml2 /CLB gates (26?) 26-73Kl2/gate net Adding ~2x Custom/MPGA, Custom/FPGA ~10x Caltech CS184 Winter DeHon

54 MPGA vs. FPGA MPGA (SOG GA) Ratio: 1--7 (2.5) Xilinx XC4K l=0.6m
tgd~1ns Ratio: (2.5) Xilinx XC4K l=0.6m 1-7 gates in 7ns 2-3 gates typical Caltech CS184 Winter DeHon

55 Processors vs. FPGAs Caltech CS184 Winter DeHon

56 Processors and FPGAs Caltech CS184 Winter DeHon

57 Component Example Single die in 0.35mm XC4085XL-09 3,136 CLBs 4.6ns
682 Bit Ops/ns Alpha 64b ALUs ns 55.7 Bit Ops/ns [1 “bit op” = 2 gate evaluations] Caltech CS184 Winter DeHon

58 Processors and FPGAs Caltech CS184 Winter DeHon

59 Raw Density Summary Area Area-Time MPGA 2-3x Custom FPGA 5x MPGA
Gate Array 6-10x Custom FPGA 15-20x Gate Array Processor 10x FPGA Caltech CS184 Winter DeHon

60 Raw Density Caveats Processor/FPGA may solve more specialized problem
Problems have different resource balance requirements …can lead to low yield of raw density Caltech CS184 Winter DeHon

61 Degrade from Peak Caltech CS184 Winter DeHon

62 Degrade from Peak: FPGAs
Long path length  not run at cycle Limited throughput requirement bottlenecks elsewhere limit throughput req. Insufficient interconnect Insufficient retiming resources (bandwidth) Caltech CS184 Winter DeHon

63 Degrade from Peak: Processors
Ops w/ no gate evaluations (interconnect) Ops use limited word width Stalls waiting for retimed data Caltech CS184 Winter DeHon

64 Degrade from Peak: Custom/MPGA
Solve more general problem than required (more gates than really need) Long path length Limited throughput requirement Not needed or applicable to a problem Caltech CS184 Winter DeHon

65 Degrade Notes We’ll cover these issues in more detail as we get into them later in the course Caltech CS184 Winter DeHon

66 Big Ideas [MSB Ideas] Raw densities: custom:ga:fpga:processor
1:5:100:1000 close gap with specialization Caltech CS184 Winter DeHon


Download ppt "CS184a: Computer Architecture (Structure and Organization)"

Similar presentations


Ads by Google