ESE534: Computer Organization

Slides:



Advertisements
Similar presentations
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Advertisements

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
Memories.
Computer Organization and Architecture + Networks
ESE532: System-on-a-Chip Architecture
Reconfigurable Architectures
ESE534: Computer Organization
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
ESE534: Computer Organization
ESE534: Computer Organization
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
ESE532: System-on-a-Chip Architecture
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
ESE534 Computer Organization
ESE532: System-on-a-Chip Architecture
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
Instructor: Dr. Phillip Jones
Architecture & Organization 1
Morgan Kaufmann Publishers Memory & Cache
Day 26: November 11, 2011 Memory Overview
ESE534: Computer Organization
We will be studying the architecture of XC3000.
ESE534: Computer Organization
The Xilinx Virtex Series FPGA
Architecture & Organization 1
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
Day 30: November 13, 2013 Memory Core: Part 2
ESE534: Computer Organization
Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.
Computer Evolution and Performance
The Xilinx Virtex Series FPGA
CS184a: Computer Architecture (Structure and Organization)
ESE534: Computer Organization
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
ESE534: Computer Organization
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
4-Bit Register Built using D flip-flops:
ESE532: System-on-a-Chip Architecture
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

ESE534: Computer Organization Day 22: November 16, 2016 Retiming Penn ESE534 Fall 2016 -- DeHon

Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will have a distribution of retiming requirements May differ from task to task May vary independently from compute/interconnect requirements Another balance issue to watch Balance with compute, interconnect Need a canonical way to measure Like Rent? Penn ESE534 Fall 2016 -- DeHon

Today Retiming Supply Technology Structures Hierarchy …or, how do we add memory (state) to architectures Penn ESE534 Fall 2016 -- DeHon

Relative Sizes Bit Operator 3-5KF2 Bit Operator Interconnect 200K-250KF2 Instruction (w/ interconnect) 20KF2 Memory bit (SRAM) 250-500F2 Memory bit (DRAM) 25F2 Flip-Flop 1000F2 Shown here are the ballpark sizes for each of the major components in these devices. Here, and elsewhere, I’ll be talking about feature sizes in normalized units of lambda, where lambda is half the minimum feature size in a VLSI process. This way I can talk about areas without tying my general results to any particular process. Lambda-squared, then is a normalized measure of area. These numbers are come from looking at the composition in VLSI layouts both constructively and decompositionally looking at existing artifacts. A bit operator, like an ALU bit or an FPGA LUT or even a gate, along with a reasonable slice of interconnect will take up 500K to 1M lambda-squared. The instruction which controls this compute and interconnect will be about one 10th to one 12th that size. A single memory bit for holding data will be fairly small, but these do add up when we collect a large number of them together. Shown here is the relative area for these components. Note that the operator itself is generally negligible in area compared to the interconnect. As we’ve noted the instruction is an order of magnitude smaller than the interconnect, but a large number of these can easily dominate the interconnect. Similarly for memory bits, they’re small by themselves but can rapidly add up to consume significant amounts of area. So, now let’s look at how these areas impact what we can build in fixed area. Penn ESE534 Fall 2016 -- DeHon

State Bit Operator 3-5KF2 Bit Operator Interconnect 200K-250KF2 Instruction (w/ interconnect) 20KF2 Memory bit (SRAM) 250-500F2 Memory bit (DRAM) 25F2 Flip-Flop 1000F2 Shown here are the ballpark sizes for each of the major components in these devices. Here, and elsewhere, I’ll be talking about feature sizes in normalized units of lambda, where lambda is half the minimum feature size in a VLSI process. This way I can talk about areas without tying my general results to any particular process. Lambda-squared, then is a normalized measure of area. These numbers are come from looking at the composition in VLSI layouts both constructively and decompositionally looking at existing artifacts. A bit operator, like an ALU bit or an FPGA LUT or even a gate, along with a reasonable slice of interconnect will take up 500K to 1M lambda-squared. The instruction which controls this compute and interconnect will be about one 10th to one 12th that size. A single memory bit for holding data will be fairly small, but these do add up when we collect a large number of them together. Shown here is the relative area for these components. Note that the operator itself is generally negligible in area compared to the interconnect. As we’ve noted the instruction is an order of magnitude smaller than the interconnect, but a large number of these can easily dominate the interconnect. Similarly for memory bits, they’re small by themselves but can rapidly add up to consume significant amounts of area. So, now let’s look at how these areas impact what we can build in fixed area. A(state bit) << A(bit-processing element) Penn ESE534 Fall 2016 -- DeHon

State Size A(state bit) << A(bit-processing element) Enabler for time-space tradeoffs Balance: can afford many bits per bit processing element 250K/1K = 250 (flip-flops) 250K/250=1K (SRAM bits) A(state bit) << A(bit-processing element) Penn ESE534 Fall 2016 -- DeHon

State Density Memory is most dense in large arrays What’s expensive Also slow, low bandwidth What’s expensive I/O Routing Bandwidth to access memory Penn ESE534 Fall 2016 -- DeHon

Reuse Distance Interpretation How long between when something is produced and when it is consumed? FSM state Produced every cycle/consumed every cycle Line buffers for video When retiming long delay Ratio memory/IO is high Can afford/exploit large memory Penn ESE534 Fall 2016 -- DeHon

Retiming Structure Concerns Area: F2/bit Throughput: bandwidth (bits/time) Energy Penn ESE534 Fall 2016 -- DeHon

Optional Output Flip-flop (optionally) on output flip-flop: 1K F2 Switch to select: ~ 1.25K F2 Area: 1 LUT ~ 250K F2/LUT Bandwidth: 1b/cycle Penn ESE534 Fall 2016 -- DeHon

Output Single Output Multiple Output Ok, if don’t need other timings of signal Multiple Output more routing Penn ESE534 Fall 2016 -- DeHon

Input More registers (K) No more interconnect than unretimed 2.5K F2/register+mux 4-LUT => 10K F2/depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (250K+d*10K F2) get Kd regs d=4, 290K F2 Bandwidth: K/cycle 1/d th capacity Penn ESE534 Fall 2016 -- DeHon

Preclass 1 Compare LUT sizing, interconnect p. Penn ESE534 Fall 2016 -- DeHon

Just Logic Blocks Most primitive build flip-flop out of logic blocks I D*/Clk + I*Clk Q Q*/Clk + I*Clk Area: 2 LUTs (250K F2/LUT each) Bandwidth: 1b/cycle Penn ESE534 Fall 2016 -- DeHon

Separate Flip-Flops Network flip flop w/ own interconnect can deploy where needed requires more interconnect Vary LUT/FF ratio Arch. Parameter Assume routing  inputs 1/4 size of LUT Area: 50K F2 each Bandwidth: 1b/cycle Penn ESE534 Fall 2016 -- DeHon

Virtex SRL16 Xilinx Virtex 4-LUT Area: ~250K F2/1616K F2/bit Use as 16b shiftreg Area: ~250K F2/1616K F2/bit Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity Penn ESE534 Fall 2016 -- DeHon

Register File Memory Bank From MIPS-X 250F2/bit + 125F2/port Area(RF) = (d+6)(W+6)(250F2+ports* 125F2) Penn ESE534 Fall 2016 -- DeHon

Preclass 2 Complete Table How small can get? Compare w=1, d=16, ports=4 case to input retiming Penn ESE534 Fall 2016 -- DeHon

Input More registers (K) No more interconnect than unretimed 2.5K F2/register+mux 4-LUT => 10K F2/depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (1M+d*10K F2) get Kd regs d=4, 290K F2 Bandwidth: K/cycle 1/d th capacity Penn ESE534 Fall 2016 -- DeHon

Preclass 2 Note compactness from wide words (share decoder) Penn ESE534 Fall 2016 -- DeHon

Xilinx CLB Xilinx 4K CLB Area: 1/2 CLB (160K F2)/1610K F2/bit as memory works like RF Area: 1/2 CLB (160K F2)/1610K F2/bit but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity Penn ESE534 Fall 2016 -- DeHon

Memory Blocks SRAM bit  300 F2 (large arrays) DRAM bit  25 F2 (large arrays) Bandwidth: W bits / 2 cycles usually single read/write 1/2A th capacity Penn ESE534 Fall 2016 -- DeHon

Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Stratix-5 20Kb Can put 250K/2501K bits in space of 4-LUT Trade few 4-LUTs for considerable memory Penn ESE534 Fall 2016 -- DeHon

Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Stratix-5 20Kb Can put 250K/2501K bits in space of 4-LUT Trade few 4-LUTs for considerable memory Penn ESE534 Fall 2016 -- DeHon

Hierarchy/Structure Summary Big Idea: “Memory Hierarchy” arises from area/bandwidth tradeoffs Smaller/cheaper to store words/blocks (saves routing and control) Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) High bandwidth out of shallow memories Lower energy out of small memories Applications have mix of retiming needs Penn ESE534 Fall 2016 -- DeHon

(Area, BW)  Hierarchy Area (F2) Bw/capacity FF/LUT 250K 1/1 netFF 50K XC 10K 1/16 RFx1 1/100 FF/RF 1K RF bit 2K SRAM 300 1/105 DRAM 25 1/107 Penn ESE534 Fall 2016 -- DeHon

Clock Cycle Radius Radius of logic can reach in one cycle (45 nm) Radius 10 (preclass 20: Lseg=5  50ps) Few hundred PEs Chip side 600-700 PE 400-500 thousand PEs 100s of cycles to cross Penn ESE534 Fall 2016 -- DeHon

Clock Cycle Radius Radius of logic memory can reach in one cycle (45 nm) Radius 10 Chip side 600-700 PE 400-500 thousand PEs 100s of cycles to cross Can only reach a small amount of data quickly More state  slower access Penn ESE534 Fall 2016 -- DeHon

Capacity vs. Delay, Energy How many hops to 3, 15, 31, 63? What fraction of memory can reach in same hops as Energy to access 63, 31, 15 compared to 3? Penn ESE534 Fall 2016 -- DeHon

Capacity vs. Delay, Energy Can only place a few things close Slower to access far things More energy to access far things More energy to select from large number of things Penn ESE534 Fall 2016 -- DeHon

Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16,32) Embedded RAMs (9Kb,20Kb,36Kb) Larger chip RAMs (X UltraRAM 100Mbs) Interface off-chip DRAM (Gbits) Retiming in interconnect (Stratix 10) Penn ESE534 Fall 2016 -- DeHon

Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (100Gb—1Tbs) Penn ESE534 Fall 2016 -- DeHon

Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) Within design, among tasks Retiming requirements vary independently of compute, interconnect requirements (balance) Wide variety of retiming costs 25 F2250K F2 Routing and I/O bandwidth big factors in costs Gives rise to memory (retiming) hierarchy Penn ESE534 Fall 2016 -- DeHon

Admin HW9 due today Final out now 1 month exercise Milestone deadlines next two Wednesdays (Wed. before Thanksgiving is not a Wednesday) Penn ESE534 Fall 2016 -- DeHon