ESE534: Computer Organization

Slides:



Advertisements
Similar presentations
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 14: March 19, 2014 Compute 2: Cascades, ALUs, PLAs.
Advertisements

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
CS294-6 Reconfigurable Computing Day 6 September 10, 1998 Comparing Computing Devices.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 9: February 7, 2007 Instruction Space Modeling.
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 5: January 24, 2007 ALUs, Virtualization…
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 9: February 24, 2014 Operator Sharing, Virtualization, Programmable Architectures.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day18: November 22, 2000 Control.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: March 3, 2003 Control.
EKT303/4 Superscalar vs Super-pipelined.
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon CS184a: Computer Architecture (Structure and Organization) Day 4: January 15, 2003 Memories, ALUs, Virtualization.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Design of Digital Circuits Lecture 14: Microprogramming
ESE532: System-on-a-Chip Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
ESE534: Computer Organization
William Stallings Computer Organization and Architecture 8th Edition
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
ESE534 Computer Organization
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
Hyperthreading Technology
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE535: Electronic Design Automation
ESE534: Computer Organization
ESE534: Computer Organization
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
Presentation transcript:

ESE534: Computer Organization Day 23: April 19, 2010 Control Penn ESE534 Spring2010 -- DeHon

Previously Looked broadly at instruction effects Explored structural components of computation Interconnect, compute, retiming Explored operator sharing/time-multiplexing Explored branching for code compactness Penn ESE534 Spring2010 -- DeHon

Today Control Different forms data-dependent operations Different forms local instruction selection Control granularity as another parameter in our architecture space Penn ESE534 Spring2010 -- DeHon

Control Control: That point where the data affects the instruction stream (operation selection) Typical manifestation data dependent branching if (a!=0) OpA else OpB bne data dependent state transitions new => goto S0 else => stay data dependent operation selection +/- addp Penn ESE534 Spring2010 -- DeHon

Control Viewpoint: can have instruction stream sequence without control I.e. static/data-independent progression through sequence of instructions is control free C0C1C2C0C1C2C0… Similarly, FSM w/ no data inputs E.g. HW3…non-branching multiplier Penn ESE534 Spring2010 -- DeHon

Programmable Architecture Day 6 Programmable Architecture Penn ESE534 Spring2010 -- DeHon

Terminology (reminder) Primitive Instruction (pinst) Collection of bits which tell a bit-processing element what to do Includes: select compute operation input sources in space (interconnect) input sources in time (retiming) Configuration Context Collection of all bits (pinsts) which describe machine’s behavior on one cycle Penn ESE534 Spring2010 -- DeHon

Back to “Any” Computation Design must handle all potential inputs (computing scenarios) Requires sufficient generality However, computation for any given input may be much smaller than general case. Instantaneous compute << potential compute Penn ESE534 Spring2010 -- DeHon

Preclass 1 if ((dx*dx+dy*dy)>threshold) z=cx*dx+cy*dy else z=log(dx*dy)+c3 How many operations performed? Cycles? Compute Blocks needed? Penn ESE534 Spring2010 -- DeHon

Preclass 1 Delay? Operations? How do? Penn ESE534 Spring2010 -- DeHon

If-Conversion Penn ESE534 Spring2010 -- DeHon

If-Conversion Trade-off: Latency Work Penn ESE534 Spring2010 -- DeHon

Preclass Operations Cycles Penn ESE534 Spring2010 -- DeHon

Day 3 Idea Compute both possible values and select correct result when we know the answer Penn ESE534 Spring2010 -- DeHon

If-Conversion ~= Predicated Operations P1=P() P2=!P1 P1 G() P2 H() P1=t1>t2 P2=!P1 … P1 z=t4 P2 z=t5 Why important? Penn ESE534 Spring2010 -- DeHon

Instruction Control Latency For time-multiplexed (data-independent) sequencing Can pipeline instruction distribution Instruction memory read Now decisionPCdistributionread latency becomes part of critical path Penn ESE534 Spring2010 -- DeHon

Screwdriver Analogy Need capability to handle Slothead Phillips Torq Hex… But only need one at a time… Penn ESE534 Spring2010 -- DeHon

Video Decoder E.g. Video decoder [frame rate = 33ms] if (packet==FRAME) if (type==I-FRAME) I-FRAME computation else if (type==B-FRAME) B-FRAME computation Penn ESE534 Spring2010 -- DeHon

Packet Processing If IP-V6 packet If IP-V4 packet If VoiP packet …. If IP-V4 packet … If VoiP packet If modem packet Penn ESE534 Spring2010 -- DeHon

Two Control Options Local control Instruction selection unify choices build all options into spatial compute structure and select operation  Mux-conversion Instruction selection provide a different instruction (instruction sequence) for each option selection occurs when chose which instruction(s) to issue Penn ESE534 Spring2010 -- DeHon

Two Control Options Local control Instruction selection May use both within an application Local control in critical path, inner-loops, where latency rather than parallelism limited Instruction-selection coarse-grain selection, where have plenty of task parallelism so latency not limit computation Penn ESE534 Spring2010 -- DeHon

FSM Example (local control) 4 4-LUTs 2 LUT Delays Penn ESE534 Spring2010 -- DeHon

FSM Example 3 4-LUTs 1 LUT Delay Penn ESE534 Spring2010 -- DeHon

Local Control LUTs used  LUT evaluations produced  Counting LUTs not tell cycle-by-cycle LUT needs Penn ESE534 Spring2010 -- DeHon

FSM Example (Instruction) 3 4-LUTs 1 LUT Delay Penn ESE534 Spring2010 -- DeHon

Local vs. Instruction If can decide early enough If load too expensive and afford schedule/reload instruction select  less computation If load too expensive local instruction faster maybe even less capacity (AT) Penn ESE534 Spring2010 -- DeHon

Slow Context Switch When context selection is slow, Instruction selection profitable only at coarse grain Xilinx ms reconfiguration times E.g. Video decoder [frame rate = 33ms] if (packet==FRAME) if (type==I-FRAME) IF-context else if (type==B-FRAME) BF-context Penn ESE534 Spring2010 -- DeHon

Local vs. Instruction For multicontext device For conventional FPGAs i.e. fast (single) cycle switch factor according to available contexts For conventional FPGAs factor only for gross differences and early binding time Penn ESE534 Spring2010 -- DeHon

FSM Control Factoring Experiment Penn ESE534 Spring2010 -- DeHon

FSM Example FSM -- canonical “control” structure captures many of these properties can implement with deep multicontext instruction selection can implement as multilevel logic unify, use local control Serve to build intuition Penn ESE534 Spring2010 -- DeHon

Full Partitioning Experiment Give each state its own context Optimize logic in state separately Tools mustang, espresso, sis, Chortle Use: one-hot encodings for single context smallest/fastest dense for multicontext assume context select needs dense Penn ESE534 Spring2010 -- DeHon

Comparison Assume stay in context for a number of LUT delays to evaluate logic/next state Pick delay from worst-case Assume single LUT-delay for context selection? savings of 1 LUT-delay => comparable time Count LUTs in worst-case state Penn ESE534 Spring2010 -- DeHon

Full Partition (Area Target) Penn ESE534 Spring2010 -- DeHon

Full Partition (Delay Target) Penn ESE534 Spring2010 -- DeHon

Full Partitioning Full partitioning comes out better ~40% less area Note: full partition may not be optimal area case e.g. intro example, no reduction in area or time beyond 2-context implementation 4-context (full partition) just more area (additional contexts) Penn ESE534 Spring2010 -- DeHon

Partitioning versus Contexts (Area) CSE benchmark Penn ESE534 Spring2010 -- DeHon

Partitioning versus Contexts (Delay) CSE benchmark Penn ESE534 Spring2010 -- DeHon

Partitioning versus Contexts (Heuristic) Start with dense mustang state encodings Greedily pick state bit which produces least greatest area split least greatest delay split Repeat until have desired number of contexts Penn ESE534 Spring2010 -- DeHon

Partition to Fixed Number of Contexts N.B. - more realistic, device has fixed number of contexts. Penn ESE534 Spring2010 -- DeHon

Extend Comparison to Memory Fully local => compute with LUTs Fully partitioned => lookup logic (context) in memory and compute logic How compare to fully memory? Simply lookup result in table? Penn ESE534 Spring2010 -- DeHon

Memory FSM Compare (small) Penn ESE534 Spring2010 -- DeHon

Memory FSM Compare (large) Penn ESE534 Spring2010 -- DeHon

Memory FSM Compare (notes) Memory selected was “optimally” sized to problem in practice, not get to pick memory allocation/organization for each FSM no interconnect charged Memory operate in single cycle but cycle slowing with inputs Smaller for <11 state+input bits Memory size not affected by CAD quality (FPGA/DPGA is) Penn ESE534 Spring2010 -- DeHon

Architectural Parameter(s) Instruction Selection Control Granularity Architectural Parameter(s) Instruction Selection Penn ESE534 Spring2010 -- DeHon

Preclass WAIT: if (in.type==header) cnt=in.header_payload_size; checksum=0; goto RECEIVE; else goto WAIT; RECEIVE: checksum=checksum xor in; data[packet][cnt]=in; cnt--; if (cnt==0) goto CHECK; else goto RECEIVE; CHECK: if (in==checksum) packet++; goto WAIT Penn ESE534 Spring2010 -- DeHon

Preclass How support two ports on VLIW with single controller? WAIT: if (in.type==header) cnt=in.header_payload_size; checksum=0; goto RECEIVE; else goto WAIT; RECEIVE: checksum=checksum xor in; data[packet][cnt]=in; cnt--; if (cnt==0) goto CHECK; else goto RECEIVE; CHECK: if (in==checksum) packet++; goto WAIT How support two ports on VLIW with single controller? Penn ESE534 Spring2010 -- DeHon

Instruction Control If FSMs advance orthogonally (really independent control) context depth => product of states for full partition I.e. w/ single controller (PC) must create product FSM which may lead to state explosion N FSMs, with S states => SN product states Penn ESE534 Spring2010 -- DeHon

Architectural Differences Day 10 What differentiates a VLIW from a multicore? Penn ESE534 Spring2010 -- DeHon

Preclass How support on two VLIWs with separate controllers? WAIT: if (in.type==header) cnt=in.header_payload_size; checksum=0; goto RECEIVE; else goto WAIT; RECEIVE: checksum=checksum xor in; data[packet][cnt]=in; cnt--; if (cnt==0) goto CHECK; else goto RECEIVE; CHECK: if (in==checksum) packet++; goto WAIT How support on two VLIWs with separate controllers? Penn ESE534 Spring2010 -- DeHon

Architectural Questions How many pinsts/controller? Penn ESE534 Spring2010 -- DeHon

Architecture Taxonomy Day 12 Architecture Taxonomy PCs Pints/PC depth width Architecture 1 FPGA 1024 32 Scalar Processor (RISC) N D W VLIW (superscalar) Small W*N SIMD, GPU, Vector MIMD 4 2048 64 Quad core Penn ESE534 Spring 2010 -- DeHon

Architectural Questions How many pinsts/controller? Fixed or Configurable assignment of controllers to pinsts? …what level of granularity? Penn ESE534 Spring2010 -- DeHon

Architectural Questions Effects of: Too many controllers? Too few controllers? Fixed controller assignment? Configurable controller assignment? Penn ESE534 Spring2010 -- DeHon

Architectural Questions Too many: wasted space on extra controllers synchronization? Too few: product state space and/or underuse logic Fixed: underuse logic if when region too big Configurable: cost interconnect, slower distribution Penn ESE534 Spring2010 -- DeHon

Admin Final Exercise Remember online course feedback update with sparing-based design discussion period end 4/26 Remember online course feedback Reading for Wednesday online Penn ESE534 Spring2010 -- DeHon

Big Ideas [MSB Ideas] Control: where data effects instructions (operation) Two forms: local control all ops resident  fast selection instruction selection may allow us to reduce instantaneous work requirements introduce issues depth, granularity, instruction load and select time Penn ESE534 Spring2010 -- DeHon

Big Ideas [MSB-1 Ideas] If-Conversion Latency vs. work tradeoff Intuition  explored canonical FSM case few context can reduce LUT requirements considerably (factor dissimilar logic) similar logic more efficient in local control overall, moderate contexts (e.g. 8) exploits both properties … better than extremes Penn ESE534 Spring2010 -- DeHon