ESE534: Computer Organization Day 12: March 5, 2014 Instruction Space Modeling 2 Penn ESE534 Spring 2014 -- DeHon
Previously Instruction Space Modeling {Area, Energy} from Architecture Application Structure Mismatch Net area, energy Efficiency Example: Wtask vs. Warch Penn ESE534 Spring 2014 -- DeHon
Today Application vs. Architecture Mismatch Net area, energy Efficiency Path Length vs. Context Depth Impact of multiple dimensions Penn ESE534 Spring 2014 -- DeHon
Context Depth Penn ESE534 Spring 2014 -- DeHon
Two Application Drivers Latency goal Throughput goal E.g. operate in real time Penn ESE534 Spring 2014 -- DeHon
Throughput Goals Examples of application throughput goals we might have? Penn ESE534 Spring 2014 -- DeHon
Throughput Goal Have 1GHz datapath (1 ns operator) Want to compute one result every 10ns How many operations can it perform per bit operator in the 10ns period? Throughput-driven Path Length Path Length: How much sequentialization is allowed? Penn ESE534 Spring 2014 -- DeHon
Latency Path Length How many primitive-operator delays before can perform next operation? Reuse the resource Penn ESE534 Spring 2014 -- DeHon
Latency Goal Reuse How many times can I reuse each primitive operator without Increasing latency? Path Length: How much sequentialization is required? Penn ESE534 Spring 2014 -- DeHon
Critical Path Single cycle multiply and add What is critical path? si=ai*qi-1+(1-ai)*xi ti=bi*ri-1+(1-bi)*xi qi=qi-1+si ri=ri-1+ti What is critical path? Penn ESE534 Spring 2014 -- DeHon
Application Needs Common critical paths? Penn ESE534 Spring 2014 -- DeHon
Examples Application Wapp Lcritpath Lpath Notes Conway LIFE 1 Run as fast as possible ECC 1-10 100 100ns memory interface Video 8 1-6 24 1GHz for 1024x1024x30 frames/s Audio 16 20,000 44KHz for 1GHz FDTD 35 1-5 Penn ESE534 Spring 2014 -- DeHon
Model Area Penn ESE534 Spring 2014 -- DeHon
Relative Sizes Bit Operator 4KF2 Bit Operator Interconnect 240KF2 Instruction (w/ interconnect) 20KF2 Memory bit (SRAM) 300F2 Shown here are the ballpark sizes for each of the major components in these devices. Here, and elsewhere, I’ll be talking about feature sizes in normalized units of lambda, where lambda is half the minimum feature size in a VLSI process. This way I can talk about areas without tying my general results to any particular process. Lambda-squared, then is a normalized measure of area. These numbers are come from looking at the composition in VLSI layouts both constructively and decompositionally looking at existing artifacts. A bit operator, like an ALU bit or an FPGA LUT or even a gate, along with a reasonable slice of interconnect will take up 500K to 1M lambda-squared. The instruction which controls this compute and interconnect will be about one 10th to one 12th that size. A single memory bit for holding data will be fairly small, but these do add up when we collect a large number of them together. Shown here is the relative area for these components. Note that the operator itself is generally negligible in area compared to the interconnect. As we’ve noted the instruction is an order of magnitude smaller than the interconnect, but a large number of these can easily dominate the interconnect. Similarly for memory bits, they’re small by themselves but can rapidly add up to consume significant amounts of area. So, now let’s look at how these areas impact what we can build in fixed area. Penn ESE534 Spring 2014 -- DeHon
Context (Instruction) Depth Penn ESE534 Spring 2014 -- DeHon
Efficiency with fixed Width Path Length Context Depth w=1, 16K PEs Penn ESE534 Spring 2014 -- DeHon
Robust Point for C Abitelm = 240K + C*30K Given compute model: HW5.2c What is area efficient robust point for C? HW5.2c “energy competitive” = robust point Penn ESE534 Spring 2014 -- DeHon
Ideal Efficiency (different model) Two resources here: active processing elements operation description/state Applications need in different proportions. Application Requirement …make sure to define lpath Penn ESE534 Spring 2014 -- DeHon
Technology Change Abitelm = 240K + C*30K What happens to the robust oint if we have a technology change that reduces the cost of memory cells? E.g. from 300F2 SRAM cells to 100F2 DRAM cells (logic process)? Penn ESE534 Spring 2014 -- DeHon
Multiple Parameters Penn ESE534 Spring 2014 -- DeHon
C Robust Point depends on Width Penn ESE534 Spring 2014 -- DeHon
Model Area Penn ESE534 Spring 2014 -- DeHon
Robust Point for C given W Abitelm = 240K + (C/W)*30K Given compute model: What is area efficient robust point for C: When W=8? When W=64? Penn ESE534 Spring 2014 -- DeHon
Robust Point depend on Width Penn ESE534 Spring 2014 -- DeHon
Intermediate Architecture w=8 c=64 16K PEs Hard to be robust across entire space… Penn ESE534 Spring 2014 -- DeHon
Processors and FPGAs (architecture vs. two application axes) c=d=1024, w=64, k=2 FPGA c=d=1, w=1, k=4 Penn ESE534 Spring 2014 -- DeHon
Caveats Model abstracts away many details that are important interconnect (day 16—20, 25) control (day 23) specialized functional units (day 15) Applications are a heterogeneous mix of characteristics Penn ESE534 Spring 2014 -- DeHon
Modeling Message Architecture space is huge Easy to be very inefficient Valuable to understand to pick right point Hard to pick one point robust across entire space Why we have so many architectures? Penn ESE534 Spring 2014 -- DeHon
General Message Parameterize architectures Look at continuum costs benefits Often have competing effects leads to maxima/minima Penn ESE534 Spring 2014 -- DeHon
Big Ideas [MSB Ideas] Instruction organization induces a design space (taxonomy) for programmable architectures Arch. structure and application requirements mismatch inefficiencies Model visualize efficiency trends Architecture space is huge can be very inefficient need to learn to navigate Penn ESE534 Spring 2014 -- DeHon
Admin Homework HW5.1 Due Today Reading Spring Break Marked 4.1 TODO 4.2, 4.3 HW5.1 Due Today Reading Spring Break Penn ESE534 Spring 2014 -- DeHon