CS184a: Computer Architecture (Structure and Organization)

Slides:



Advertisements
Similar presentations
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
CS294-6 Reconfigurable Computing Day 6 September 10, 1998 Comparing Computing Devices.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 31, 2003 Compute 2:
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 5: January 17, 2003 What is required for Computation?
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
CBSSS 2002: DeHon Universal Programming Organization André DeHon Tuesday, June 18, 2002.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
ESE532: System-on-a-Chip Architecture
Computer Organization
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Lecture 27 Logistics Last lecture Today: HW8 due Friday
ESE534 Computer Organization
CS137: Electronic Design Automation
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
Lecture 27 Logistics Last lecture Today: HW8 due Friday
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
ESE534: Computer Organization
ESE534: Computer Organization
CSE 370 – Winter Sequential Logic-2 - 1
ESE534: Computer Organization
CS184a: Computer Architecture (Structures and Organization)
ESE535: Electronic Design Automation
CS184a: Computer Architecture (Structures and Organization)
ESE534: Computer Organization
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE534: Computer Organization
Lecture 23 Logistics Last lecture Today HW8 due Wednesday, March 11
CS184a: Computer Architecture (Structure and Organization)
CSE 370 – Winter Sequential Logic-2 - 1
Presentation transcript:

CS184a: Computer Architecture (Structure and Organization) Day 22: March 4, 2005 Control Caltech CS184 Winter2005 -- DeHon

Previously Looked broadly at instruction effects Looked at structural components of computation interconnect compute retiming Looked at time-multiplexing Caltech CS184 Winter2005 -- DeHon

Today Control Different forms Architectural Issues data-dependent operations Different forms local instruction selection Architectural Issues Caltech CS184 Winter2005 -- DeHon

Control Control: That point where the data affects the instruction stream (operation selection) Typical manifestation data dependent branching if (a!=0) OpA else OpB bne data dependent state transitions new => goto S0 else => stay data dependent operation selection +/- addp Caltech CS184 Winter2005 -- DeHon

Control Viewpoint: can have instruction stream sequence without control I.e. static/data-independent progression through sequence of instructions is control free C0C1C2C0C1C2C0 … Similarly, FSM w/ no data inputs E.g. Day 5…non-branching datapath Caltech CS184 Winter2005 -- DeHon

Our “First” Programmable Architecture Day 5 Our “First” Programmable Architecture Caltech CS184 Winter2005 -- DeHon

Terminology (reminder) Primitive Instruction (pinst) Collection of bits which tell a bit-processing element what to do Includes: select compute operation input sources in space (interconnect) input sources in time (retiming) Configuration Context Collection of all bits (pinsts) which describe machine’s behavior on one cycle Caltech CS184 Winter2005 -- DeHon

Why? Why do we need / want control? Static interconnect sufficient? Static sequencing? Static datapath operations? Caltech CS184 Winter2005 -- DeHon

Back to “Any” Computation Design must handle all potential inputs (computing scenarios) Requires sufficient generality However, computation for any given input may be much smaller than general case. Instantaneous computation << potential computation Caltech CS184 Winter2005 -- DeHon

Screwdriver Analogy Need capability to handle Slothead Phillips Torq Hex… But only need one at a time… Caltech CS184 Winter2005 -- DeHon

Video Decoder E.g. Video decoder [frame rate = 33ms] if (packet==FRAME) if (type==I-FRAME) I-FRAME computation else if (type==B-FRAME) B-FRAME computation Caltech CS184 Winter2005 -- DeHon

Packet Processing If IP-V6 packet If IP-V4 packet If VoiP packet …. If IP-V4 packet … If VoiP packet If modem packet Caltech CS184 Winter2005 -- DeHon

Two Control Options Local control Instruction selection unify choices build all options into spatial compute structure and select operation Instruction selection provide a different instruction (instruction sequence) for each option selection occurs when chose which instruction(s) to issue Caltech CS184 Winter2005 -- DeHon

Example: ASCII HexBinary If (c>=0x30 && c<=0x39) res=c-0x30 // 0x30 = ‘0’ elseif (c>=0x41 && c<=0x46) res=c-0x41+10 // 0x41 = ‘A’ elseif (c>=0x61 && c<=0x66) res=c-0x61+10 // 0x61 = ‘a’ else res=0 Caltech CS184 Winter2005 -- DeHon

Local Control I<7:4> I<3:0> 01x0? P3*P1+P2*P0 0011? 16? <=9? P2*P0 C1*C0*I<3>+ C1*/C0 C1*(I<1>*C0+ /C0*(I<1> xor I<0>)) C0*I<2>+ /C0*(I<2> xor I<0>*I<1>) C1*(I<0>==C0) C1*r2a P3 P2 P1 P0 C1 C0 r2a R<3> R<2> R<1> R<0> Which case? One of three cases I<3:0> I<3:0>+0x1001 Caltech CS184 Winter2005 -- DeHon

Local Control LUTs used  LUT evaluations produced This example: each case only requires 4 4-LUTs to produce R<3:0> takes 5 4-LUTs once unified => Counting LUTs not tell cycle-by-cycle LUT needs Caltech CS184 Winter2005 -- DeHon

Instruction Control Static Sequence P3 P2 P1 P0 C1 C0 R3 r2a R1 R0 R2 01x0? P3*P1+P2*P0 0011? 16? <=9? P2*P0 C1*C0*I<3>+ C1*/C0 C1*(I<1>*C0+ /C0*(I<1> xor I<0>)) C0*I<2>+ /C0*(I<2> xor I<0>*I<1>) C1*(I<0>==C0) C1*r2a P3 P2 P1 P0 C1 C0 r2a R<3> R<2> R<1> R<0> Instruction Control 00 01 10 11 P3 P2 P1 P0 0 1 C0 1 C1 0 0 0 0 0 0 R3 R2 R1 R0 0 0 Caltech CS184 Winter2005 -- DeHon

Static/Control Static sequence Control 4 4-LUTs depth 4 4 context maybe 3 shuffle r2a w/ C1, C0 execute 0 1 1 2 0 1 1 2 Control 6 4-LUTs depth 3 4 contexts Example too simple to show big savings... Caltech CS184 Winter2005 -- DeHon

Local vs. Instruction If can decide early enough If load too expensive and afford schedule/reload instruction select  less computation If load too expensive local instruction faster maybe even less capacity (AT) Caltech CS184 Winter2005 -- DeHon

Slow Context Switch Instruction selection profitable only at coarse grain Xilinx ms reconfiguration times HSRA ms reconfiguration times still 1000s of cycles E.g. Video decoder [frame rate = 33ms] if (packet==FRAME) if (type==I-FRAME) IF-context else if (type==B-FRAME) BF-context Caltech CS184 Winter2005 -- DeHon

Local vs. Instruction For multicontext device For conventional devices i.e. fast (single) cycle switch factor according to available contexts For conventional devices factor only for gross differences and early binding time Caltech CS184 Winter2005 -- DeHon

Optimization Minimize Capacity Consumed: Basic Components ATlocal = Agen Tgen ATselect = Aselect (Tselect+Tload) Tload0 if can overlap w/ previous operation know early enough background load have sufficient bandwidth to load Basic Components Tload -- config. time Tselect -- case compute time Tgen -- generalized compute time Aselect -- case compute area Agen -- generalized compute area Caltech CS184 Winter2005 -- DeHon

FSM Control Factoring Experiment Caltech CS184 Winter2005 -- DeHon

FSM Example FSM -- canonical “control” structure captures many of these properties can implement with deep multicontext instruction selection can implement as multilevel logic unify, use local control Serve to build intuition Caltech CS184 Winter2005 -- DeHon

FSM Example (local control) 4 4-LUTs 2 LUT Delays Caltech CS184 Winter2005 -- DeHon

FSM Example (Instruction) 3 4-LUTs 1 LUT Delay Caltech CS184 Winter2005 -- DeHon

Full Partitioning Experiment Give each state its own context Optimize logic in state separately Tools mustang, espresso, sis, Chortle Use: one-hot encodings for single context smallest/fastest dense for multicontext assume context select needs dense Caltech CS184 Winter2005 -- DeHon

Look at Assume stay in context for a number of LUT delays to evaluate logic/next state Pick delay from worst-case Assume single LUT-delay for context selection? savings of 1 LUT-delay => comparable time Count LUTs in worst-case state Caltech CS184 Winter2005 -- DeHon

Full Partition (Area Target) Caltech CS184 Winter2005 -- DeHon

Full Partition (Delay Target) Caltech CS184 Winter2005 -- DeHon

Full Partitioning Full partitioning comes out better ~40% less area Note: full partition may not be optimal area case e.g. intro example, no reduction in area or time beyond 2-context implementation 4-context (full partition) just more area (additional contexts) Caltech CS184 Winter2005 -- DeHon

Partitioning versus Contexts (Area) CSE benchmark Caltech CS184 Winter2005 -- DeHon

Partitioning versus Contexts (Delay) Caltech CS184 Winter2005 -- DeHon

Partitioning versus Contexts (Heuristic) Start with dense mustang state encodings Greedily pick state bit which produces least greatest area split least greatest delay split Repeat until have desired number of contexts Caltech CS184 Winter2005 -- DeHon

Partition to Fixed Number of Contexts N.B. - more realistic, device has fixed number of contexts. Caltech CS184 Winter2005 -- DeHon

Extend Comparison to Memory Fully local => compute with LUTs Fully partitioned => lookup logic (context) in memory and compute logic How compare to fully memory? Simply lookup result in table? Caltech CS184 Winter2005 -- DeHon

Memory FSM Compare (small) Caltech CS184 Winter2005 -- DeHon

Memory FSM Compare (large) Caltech CS184 Winter2005 -- DeHon

Memory FSM Compare (notes) Memory selected was “optimally” sized to problem in practice, not get to pick memory allocation/organization for each FSM no interconnect charged Memory operate in single cycle but cycle slowing with inputs Smaller for <11 state+input bits Memory size not affected by CAD quality (FPGA/DPGA is) Caltech CS184 Winter2005 -- DeHon

Control Granularity Caltech CS184 Winter2005 -- DeHon

Control Granularity What if we want to run multiple of these FSMs on the same component? Local Instruction Caltech CS184 Winter2005 -- DeHon

Consider Two network data ports states: idle, first-datum, receiving, closing data arrival uncorrelated between ports Caltech CS184 Winter2005 -- DeHon

Local Control Multi-FSM Not rely on instructions Each wired up independently Easy to have multiple FSMs (units of control) Caltech CS184 Winter2005 -- DeHon

Instruction Control If FSMs advance orthogonally (really independent control) context depth => product of states for full partition I.e. w/ single controller (PC) must create product FSM which may lead to state explosion N FSMs, with S states => SN product states This example: 4 states, 2 FSMs => 16 state composite FSM Caltech CS184 Winter2005 -- DeHon

Architectural Questions How many pinsts/controller? Fixed or Configurable assignment of controllers to pinsts? …what level of granularity? Caltech CS184 Winter2005 -- DeHon

Architectural Questions Effects of: Too many controllers? Too few controllers? Fixed controller assignment? Configurable controller assignment? Caltech CS184 Winter2005 -- DeHon

Architectural Questions Too many: wasted space on extra controllers synchronization? Too few: product state space and/or underuse logic Fixed: underuse logic if when region too big Configurable: cost interconnect, slower distribution Caltech CS184 Winter2005 -- DeHon

Control and FPGAs Local/single instruction not rely on controller Potential strength of FPGA Easy to breakup capacity and deploy to orthogonal tasks How processor handle orthogonal tasks? Efficiency? Caltech CS184 Winter2005 -- DeHon

Control and FPGAs Data dependent selection potentially fast w/ local control compared to uP Can examine many bits and perform multi-way branch (output generation) in just a few LUT cycles mP requires sequence of operations Caltech CS184 Winter2005 -- DeHon

Architecture Instr. Taxonomy Caltech CS184 Winter2005 -- DeHon

Admin FEEDBACK sheets Final classes Monday/Wednesday Course EAS Monday: Specialization Wednesday: wrapup, Q&A… Caltech CS184 Winter2005 -- DeHon

Big Ideas [MSB Ideas] Control: where data effects instructions (operation) Two forms: local control all ops resident => fast selection instruction selection may allow us to reduce instantaneous work requirements introduce issues depth, granularity, instruction load time Caltech CS184 Winter2005 -- DeHon

Big Ideas [MSB-1 Ideas] Intuition => looked at canonical FSM case few context can reduce LUT requirements considerably (factor dissimilar logic) similar logic more efficient in local control overall, moderate contexts (e.g. 8) exploits both properties better than extremes single context (all local control) full partition flat memory (except for smallest FSMs) Caltech CS184 Winter2005 -- DeHon