ESE534: Computer Organization

Slides:

Advertisements

Similar presentations

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Advertisements

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 4: January 22, 2007 Memories.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.

CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.

Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.

ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.

System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.

Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.

ESE532: System-on-a-Chip Architecture

ESE534: Computer Organization

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

ESE534: Computer Organization

ESE534: Computer Organization

ESE532: System-on-a-Chip Architecture

Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

ESE532: System-on-a-Chip Architecture

CS184a: Computer Architecture (Structure and Organization)

Instructor: Dr. Phillip Jones

Morgan Kaufmann Publishers Memory & Cache

Spartan FPGAs مرتضي صاحب الزماني.

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Day 26: November 11, 2011 Memory Overview

ESE534: Computer Organization

Field Programmable Gate Array

Field Programmable Gate Array

Field Programmable Gate Array

We will be studying the architecture of XC3000.

The Xilinx Virtex Series FPGA

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

Day 26: November 1, 2013 Synchronous Circuits

Day 29: November 11, 2013 Memory Core: Part 1

ESE534: Computer Organization

Day 30: November 13, 2013 Memory Core: Part 2

ESE532: System-on-a-Chip Architecture

CS184a: Computer Architecture (Structures and Organization)

The Xilinx Virtex Series FPGA

ESE534: Computer Organization

Sequential Logic.

ESE534: Computer Organization

ESE534: Computer Organization

ESE534: Computer Organization

CS184a: Computer Architecture (Structure and Organization)

Day 25: November 8, 2010 Memory Core

4-Bit Register Built using D flip-flops:

ESE532: System-on-a-Chip Architecture

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

ESE534: Computer Organization Day 20: April 2, 2012 Retiming Penn ESE534 Spring2012 -- DeHon

Today Retiming Demand Retiming Supply Folded Computation Logical Pipelining Physical Pipelining Retiming Supply Technology Structures Hierarchy Penn ESE534 Spring2012 -- DeHon

Retiming Demand Penn ESE534 Spring2012 -- DeHon

Image Processing Many operations can be described in terms of 2D window filters Compute value as a weighted sum of the neighborhood Blurring, edge and feature detection, object recognition and tracking, motion estimation, VLSI design rule checking Penn ESE534 Spring2012 -- DeHon

Preclass 2 Describes window computation: 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2012 -- DeHon

Preclass 2 How many times is each in[x][y] used? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2012 -- DeHon

Preclass 2 Sequentialized on one multiplier, 1: for x=0 to N-1 Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2012 -- DeHon

Window Penn ESE534 Spring2012 -- DeHon

Parallel Scaling/Spatial Sum Compute entire window at once More hardware, fewer cycles Penn ESE534 Spring2012 -- DeHon

Preclass 2 Unroll inner loop, 1: for x=0 to N-1 2: for y=0 to N-1 Distance between in[x][y] uses? 1: for x=0 to N-1 2: for y=0 to N-1 3: out[x][y]=0; 4: for wx=0 to W-1 5: for wy=0 to W-1 6: out[x][y]+=in[x+wx][y+wy]*c[wx][wy] Penn ESE534 Spring2012 -- DeHon

Fully Spatial What if we gave each pixel its own processor? Penn ESE534 Spring2012 -- DeHon

Preclass 1 How many registers on each link? Penn ESE534 Spring2012 -- DeHon

Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle MCNC benchmarks to 256 4-LUTs no interconnect accounting average 1.7 registers/LUT (some circuits 2--7) Penn ESE534 Spring2012 -- DeHon

Long Interconnect Path What happens if one of these links ends up on a long interconnect path? Penn ESE534 Spring2012 -- DeHon

Pipeline Interconnect Path To avoid cycle being limited by longest interconnect Pipeline network Penn ESE534 Spring2012 -- DeHon

Chips >> Cycles Chips growing Gate delays shrinking Wire delays aren’t scaling down Will take many cycles to cross chip Penn ESE534 Spring2012 -- DeHon

Clock Cycle Radius Radius of logic can reach in one cycle (45 nm) Radius 10 (preclass 19: Lseg=5  50ps) Few hundred PEs Chip side 600-700 PE 400-500 thousand PEs 100s of cycles to cross Penn ESE534 Spring2012 -- DeHon

Pipelined Interconnect In what cases is this convenient? Penn ESE534 Spring2012 -- DeHon

Pipelined Interconnect When might it pose a challenge? Penn ESE534 Spring2012 -- DeHon

Long Interconnect Path What happens here? C A C B Penn ESE534 Spring2012 -- DeHon

Long Interconnect Path Adds pipeline delays May force further pipelining of logic to balance out paths More registers C A C B Penn ESE534 Spring2012 -- DeHon

Reminder Flop Experiment #1 Pipeline/C-slow/retime to single LUT delay per cycle MCNC benchmarks to 256 4-LUTs no interconnect accounting average 1.7 registers/LUT (some circuits 2--7) Penn ESE534 Spring2012 -- DeHon

Flop Experiment #2 Pipeline and retime to HSRA cycle place on HSRA single LUT or interconnect timing domain same MCNC benchmarks average 4.7 registers/LUT [Tsu et al., FPGA 1999] Penn ESE534 Spring2012 -- DeHon

Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will have a distribution of retiming requirements May differ from task to task May vary independently from compute/interconnect requirements Another balance issue to watch Balance with compute, interconnect Need a canonical way to measure Like Rent? Penn ESE534 Spring2012 -- DeHon

Retiming Supply Penn ESE534 Spring2012 -- DeHon

Optional Output Flip-flop (optionally) on output flip-flop: 1K F2 Switch to select: ~ 1.25K F2 Area: 1 LUT (200K250K F2/LUT) Bandwidth: 1b/cycle Penn ESE534 Spring2012 -- DeHon

Output Single Output Multiple Output Ok, if don’t need other timings of signal Multiple Output more routing Penn ESE534 Spring2012 -- DeHon

Input More registers (K) No more interconnect than unretimed 2.5K F2/register+mux 4-LUT => 10K F2/depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (250K+d*10K F2) get Kd regs d=4, 290K F2 Bandwidth: K/cycle 1/d th capacity Penn ESE534 Spring2012 -- DeHon

Preclass 3 Penn ESE534 Spring2012 -- DeHon

Some Numbers (memory) Unit of area = F2 (F=2l) Day 7 Some Numbers (memory) Unit of area = F2 (F=2l) Register as stand-alone element  1000 F2 e.g. as needed/used Day 4 Static RAM cell  250 F2 SRAM Memory (single ported) Dynamic RAM cell (DRAM process)  25 F2 Dynamic RAM cell (SRAM process)  75 F2 Penn ESE534 Spring2012 -- DeHon

Retiming Density LUT+interconnect  250K F2 Register as stand-alone element  1K F2 Static RAM cell  250 F2 SRAM Memory (single ported) Dynamic RAM cell (DRAM process)  25 F2 Dynamic RAM cell (SRAM process)  75 F2 Can have much more retiming memory per chip if put it in large arrays …but then cannot get to it as frequently Penn ESE534 Spring2012 -- DeHon

Retiming Structure Concerns Area: F2/bit Throughput: bandwidth (bits/time) Energy Penn ESE534 Spring2012 -- DeHon

Just Logic Blocks Most primitive build flip-flop out of logic blocks I D*/Clk + I*Clk Q Q*/Clk + I*Clk Area: 2 LUTs (200K250K F2/LUT each) Bandwidth: 1b/cycle Penn ESE534 Spring2012 -- DeHon

Separate Flip-Flops Network flip flop w/ own interconnect can deploy where needed requires more interconnect Vary LUT/FF ratio Arch. Parameter Assume routing  inputs 1/4 size of LUT Area: 50K F2 each Bandwidth: 1b/cycle Penn ESE534 Spring2012 -- DeHon

Virtex SRL16 Xilinx Virtex 4-LUT Area: ~250K F2/1616K F2/bit Use as 16b shiftreg Area: ~250K F2/1616K F2/bit Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity Penn ESE534 Spring2012 -- DeHon

Register File Memory Bank From MIPS-X 250F2/bit + 125F2/port Area(RF) = (d+6)(W+6)(250F2+ports* 125F2) Penn ESE534 Spring2012 -- DeHon

Preclass 4 Complete Table How small can get? Compare w=1, p=8 case to input retiming Penn ESE534 Spring2012 -- DeHon

Input More registers (K) No more interconnect than unretimed 2.5K F2/register+mux 4-LUT => 10K F2/depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (1M+d*10K F2) get Kd regs d=4, 290K F2 Bandwidth: K/cycle 1/d th capacity Penn ESE534 Spring2012 -- DeHon

Preclass 4 Note compactness from wide words (share decoder) Penn ESE534 Spring2012 -- DeHon

Xilinx CLB Xilinx 4K CLB Area: 1/2 CLB (160K F2)/1610K F2/bit as memory works like RF Area: 1/2 CLB (160K F2)/1610K F2/bit but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity Penn ESE534 Spring2012 -- DeHon

Memory Blocks SRAM bit  300 F2 (large arrays) DRAM bit  25 F2 (large arrays) Bandwidth: W bits / 2 cycles usually single read/write 1/2A th capacity Penn ESE534 Spring2012 -- DeHon

Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Can put 250K/2501K bits in space of 4-LUT Trade few 4-LUTs for considerable memory Penn ESE534 Spring2012 -- DeHon

Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Can put 250K/2501K bits in space of 4-LUT Trade few 4-LUTs for considerable memory Penn ESE534 Spring2012 -- DeHon

Hierarchy/Structure Summary Big Idea: “Memory Hierarchy” arises from area/bandwidth tradeoffs Smaller/cheaper to store words/blocks (saves routing and control) Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) High bandwidth out of shallow memories Applications have mix of retiming needs Penn ESE534 Spring2012 -- DeHon

(Area, BW)  Hierarchy Area (F2) Bw/capacity FF/LUT 250K 1/1 netFF 50K XC 10K 1/16 RFx1 1/100 FF/RF 1K RF bit 2K SRAM 300 1/105 DRAM 25 1/107 Penn ESE534 Spring2012 -- DeHon

Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16) Embedded RAMs (9Kb,36Kb) Interface off-chip DRAM (~0.1—1Gb) No retiming in interconnect ….yet Penn ESE534 Spring2012 -- DeHon

Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) Lots of pipelining in memory path… Reorder Buffer (4—32) Architected RF (16, 32, 128) Actual RF (256, 512…) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (~10-100Gb) Penn ESE534 Spring2012 -- DeHon

Admin Final out now Wednesday Lossless compression? 1 month exercise Milestone deadlines next two Mondays Wednesday Lossless compression? Final assignment + cited papers for Wed? OR Reading for Wednesday on web? Weigh in on feedback sheet Penn ESE534 Spring2012 -- DeHon

Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) Within design, among tasks Retiming requirements vary independently of compute, interconnect requirements (balance) Wide variety of retiming costs 25 F2250K F2 Routing and I/O bandwidth big factors in costs Gives rise to memory (retiming) hierarchy Penn ESE534 Spring2012 -- DeHon