Penn ESE680-002 Spring2007 -- DeHon 1 ESE680-002 (ESE534): Computer Organization Day 23: April 9, 2007 Control.

Slides:



Advertisements
Similar presentations
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
CS294-6 Reconfigurable Computing Day 5 September 8, 1998 Comparing Computing Devices.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
CS294-6 Reconfigurable Computing Day 6 September 10, 1998 Comparing Computing Devices.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 10: February 12, 2007 Empirical Comparisons.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
ENGIN112 L25: State Reduction and Assignment October 31, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 25 State Reduction and Assignment.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 12: February 21, 2007 Compute 2: Cascades, ALUs, PLAs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 4, 2012 Lossless Data Compression.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
ESE534 Computer Organization
CS184a: Computer Architecture (Structure and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
CS184a: Computer Architecture (Structures and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
Presentation transcript:

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control

Penn ESE Spring DeHon 2 Previously Looked broadly at instruction effects Looked at structural components of computation –interconnect –compute –retiming Looked at time-multiplexing

Penn ESE Spring DeHon 3 Today Control –data-dependent operations Different forms –local –instruction selection Architectural Issues

Penn ESE Spring DeHon 4 Control Control: That point where the data affects the instruction stream (operation selection) –Typical manifestation data dependent branching –if (a!=0) OpA else OpB – bne data dependent state transitions –new => goto S0 –else => stay data dependent operation selection +/- addp

Penn ESE Spring DeHon 5 Control Viewpoint: can have instruction stream sequence without control –I.e. static/data-independent progression through sequence of instructions is control free C0  C1  C2  C0  C1  C2  C0  … –Similarly, FSM w/ no data inputs –E.g. Day 5…non-branching datapath

Penn ESE Spring DeHon 6 Our “First” Programmable Architecture Day 5

Penn ESE Spring DeHon 7 Terminology (reminder) Primitive Instruction (pinst) –Collection of bits which tell a bit-processing element what to do –Includes: select compute operation input sources in space (interconnect) input sources in time (retiming) Configuration Context –Collection of all bits (pinsts) which describe machine’s behavior on one cycle

Penn ESE Spring DeHon 8 Why? Why do we need / want control? Static interconnect sufficient? Static sequencing? Static datapath operations?

Penn ESE Spring DeHon 9 Back to “Any” Computation Design must handle all potential inputs (computing scenarios) Requires sufficient generality However, computation for any given input may be much smaller than general case. Instantaneous computation << potential computation

Penn ESE Spring DeHon 10 Screwdriver Analogy Need capability to handle –Slothead –Phillips –Torq –Hex… But only need one at a time…

Penn ESE Spring DeHon 11 Video Decoder E.g. Video decoder [frame rate = 33ms] –if (packet==FRAME) if (type==I-FRAME) –I-FRAME computation else if (type==B-FRAME) –B-FRAME computation

Penn ESE Spring DeHon 12 Packet Processing If IP-V6 packet –…. If IP-V4 packet –… If VoiP packet –… If modem packet –….

Penn ESE Spring DeHon 13 Two Control Options 1.Local control –unify choices build all options into spatial compute structure and select operation 2.Instruction selection –provide a different instruction (instruction sequence) for each option –selection occurs when chose which instruction(s) to issue

Penn ESE Spring DeHon 14 FSM Example (local control) 4 4-LUTs 2 LUT Delays

Penn ESE Spring DeHon 15 FSM Example 3 4-LUTs 1 LUT Delay

Penn ESE Spring DeHon 16 Local Control LUTs used  LUT evaluations produced  Counting LUTs not tell cycle-by-cycle LUT needs

Penn ESE Spring DeHon 17 FSM Example (Instruction) 3 4-LUTs 1 LUT Delay

Penn ESE Spring DeHon 18 Local vs. Instruction If can decide early enough –and afford schedule/reload –instruction select  less computation If load too expensive –local instruction faster maybe even less capacity (AT)

Penn ESE Spring DeHon 19 How show up in modern  P? Instruction? Local control?

Penn ESE Spring DeHon 20 Slow Context Switch Instruction selection profitable only at coarse grain –Xilinx ms reconfiguration times –HSRA  s reconfiguration times still 1000s of cycles E.g. Video decoder [frame rate = 33ms] –if (packet==FRAME) if (type==I-FRAME) –IF-context else if (type==B-FRAME) –BF-context

Penn ESE Spring DeHon 21 Local vs. Instruction For multicontext device –i.e. fast (single) cycle switch –factor according to available contexts For conventional devices –factor only for gross differences –and early binding time

Penn ESE Spring DeHon 22 Optimization Basic Components –T load -- config. time –T select -- case compute time –T gen -- generalized compute time –A select -- case compute area –A gen -- generalized compute area Minimize Capacity Consumed: –AT local = A gen  T gen –AT select = A select  (T select +T load ) T load  0 if can overlap w/ previous operation –know early enough –background load –have sufficient bandwidth to load

Penn ESE Spring DeHon 23 FSM Control Factoring Experiment

Penn ESE Spring DeHon 24 FSM Example FSM -- canonical “control” structure –captures many of these properties –can implement with deep multicontext instruction selection –can implement as multilevel logic unify, use local control Serve to build intuition

Penn ESE Spring DeHon 25 Full Partitioning Experiment Give each state its own context Optimize logic in state separately Tools –mustang, espresso, sis, Chortle Use: –one-hot encodings for single context smallest/fastest –dense for multicontext assume context select needs dense

Penn ESE Spring DeHon 26 Look at Assume stay in context for a number of LUT delays to evaluate logic/next state Pick delay from worst-case Assume single LUT-delay for context selection? – savings of 1 LUT-delay => comparable time Count LUTs in worst-case state

Penn ESE Spring DeHon 27 Full Partition ( Area Target)

Penn ESE Spring DeHon 28 Full Partition ( Delay Target)

Penn ESE Spring DeHon 29 Full Partitioning Full partitioning comes out better –~40% less area Note: full partition may not be optimal area case –e.g. intro example, no reduction in area or time beyond 2-context implementation 4-context (full partition) just more area –(additional contexts)

Penn ESE Spring DeHon 30 Partitioning versus Contexts (Area) CSE benchmark

Penn ESE Spring DeHon 31 Partitioning versus Contexts (Delay) CSE benchmark

Penn ESE Spring DeHon 32 Partitioning versus Contexts (Heuristic) Start with dense mustang state encodings Greedily pick state bit which produces –least greatest area split –least greatest delay split Repeat until have desired number of contexts

Penn ESE Spring DeHon 33 Partition to Fixed Number of Contexts N.B. - more realistic, device has fixed number of contexts.

Penn ESE Spring DeHon 34 Extend Comparison to Memory Fully local => compute with LUTs Fully partitioned => lookup logic (context) in memory and compute logic How compare to fully memory? –Simply lookup result in table?

Penn ESE Spring DeHon 35 Memory FSM Compare (small)

Penn ESE Spring DeHon 36 Memory FSM Compare (large)

Penn ESE Spring DeHon 37 Memory FSM Compare (notes) Memory selected was “optimally” sized to problem –in practice, not get to pick memory allocation/organization for each FSM –no interconnect charged Memory operate in single cycle –but cycle slowing with inputs Smaller for <11 state+input bits Memory size not affected by CAD quality (FPGA/DPGA is)

Penn ESE Spring DeHon 38 Control Granularity

Penn ESE Spring DeHon 39 Control Granularity What if we want to run multiple of these FSMs on the same component? –Local –Instruction

Penn ESE Spring DeHon 40 Consider Two network data ports –states: idle, first-datum, receiving, closing –data arrival uncorrelated between ports

Penn ESE Spring DeHon 41 Local Control Multi-FSM Not rely on instructions Each wired up independently Easy to have multiple FSMs –(units of control)

Penn ESE Spring DeHon 42 Instruction Control If FSMs advance orthogonally –(really independent control) –context depth => product of states for full partition –I.e. w/ single controller (PC) must create product FSM which may lead to state explosion –N FSMs, with S states => S N product states –This example: 4 states, 2 FSMs => 16 state composite FSM

Penn ESE Spring DeHon 43 Architectural Questions How many pinsts/controller? Fixed or Configurable assignment of controllers to pinsts? –…what level of granularity?

Penn ESE Spring DeHon 44 Architectural Questions Effects of: –Too many controllers? –Too few controllers? –Fixed controller assignment? –Configurable controller assignment?

Penn ESE Spring DeHon 45 Architectural Questions Too many: –wasted space on extra controllers –synchronization? Too few: –product state space and/or underuse logic Fixed: –underuse logic if when region too big Configurable: –cost interconnect, slower distribution

Penn ESE Spring DeHon 46 Control and FPGAs Local/single instruction not rely on controller Potential strength of FPGA Easy to breakup capacity and deploy to orthogonal tasks How processor handle orthogonal tasks? –Efficiency?

Penn ESE Spring DeHon 47 Control and FPGAs Data dependent selection –potentially fast w/ local control compared to  P –Can examine many bits and perform multi-way branch (output generation) in just a few LUT cycles –  P requires sequence of operations

Penn ESE Spring DeHon 48 Architecture Instr. Taxonomy

Penn ESE Spring DeHon 49 Admin

Penn ESE Spring DeHon 50 Big Ideas [MSB Ideas] Control: where data effects instructions (operation) Two forms: –local control all ops resident  fast selection –instruction selection may allow us to reduce instantaneous work requirements introduce issues –depth, granularity, instruction load time

Penn ESE Spring DeHon 51 Big Ideas [MSB-1 Ideas] Intuition => looked at canonical FSM case –few context can reduce LUT requirements considerably (factor dissimilar logic) –similar logic more efficient in local control –overall, moderate contexts (e.g. 8) exploits both properties better than extremes –single context (all local control) –full partition –flat memory (except for smallest FSMs)