CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing
Previously Homogenous model of computational array –single word granularity, depth, interconnect –all post-fabrication programmable Understand tradeoffs of each
Today Heterogeneous architectures –Why? –How? catalog of techniques fit in framework optimization and mapping
Why? Why would we be interested in heterogeneous architecture? –E.g.
Why? Applications have a mix of characteristics Already accepted –seldom can afford to build most general (unstructured) array bit-level, deep context, p=1 –=> are picking some structure to exploit May be beneficial to have portions of computations optimized for different structure conditions.
Examples Processor+FPGA Processors or FPGA add –multiplier or MAC unit –FPU –Motion Estimation coprocessor
Optimization Prospect Less capacity for composite than either pure –(A 1 +A 2 )T 12 < A 1 T 1 –(A 1 +A 2 )T 12 < A 2 T 2
Optimization Prospect Example Floating Point –Task: I integer Ops + F FP-ADDs –A proc =125M 2 –A FPU =40M 2 –I cycles / FP Ops = 60 –125(I+60F) 165(I+F) ( )/40 = I/F 183 I/F
How? Design issues: –Interconnect space and time –Control –Instructions configuration path and control Mapping Cost/Benefits: –Costs Area, Power –Performance Bandwidth, latency
Interconnect Bus (degenerate network) Memory (shared retiming resource) RF/Coproc (traditional processor inter.) Network
Interconnect: Bus Minimal physical network –shared with memory and/or other peripherals –10s-100s of cycles away from processor (fpga) –low->moderate bandwidth –can handle multiple, different functional units but serial bottleneck of bus prevents simultaneous communication among devices
Interconnect: Bus Example XC6200
Interconnect: Memory Use memory (retiming) block to buffer data between heterogeneous regions –DMA (usually implies shared bus) –FIFO –dual port or shared RAM decoupled, moderate latency (10-100cycles) moderate bandwidth
Interconnect: Memory Example PAM, SPLASH
Interconnect: RF/Coproc Coupled directly to processor datapath –low latency (1-2 cycles) –moderately high bandwidth limited by RF ports and control
Interconnect: RF/Coproc Examples GARP, Chimaera –(more on this case Thursday)
Interconnect: Network Unify spatial network composing various heterogeneous components –high bandwidth –latency vary with distance –support simultaneous operate and data transfer –potentially dominant cost A interconnect > A function –Granularity question coarse (large blocks of each type) fine (interleaved)
Interconnect: Network Coarse Cheops, Pleiades
HSRA Heterogeneous Blocks
Interconnect: Network Coarse vs. Fine Multiplier/FPGA example
Interconnect: Network Coarse vs. Fine Fine –possibly share interconnect –locality –uniform tiling –not shared may get concentrations of heavy/no use interconnect limit use as independent resources –ratio less flexible? –More difficult design Coarse –flexible ratio –easier to keep dense homogeneous blocks –requires own interconnect –doesn’t disrupt base layout(s) –non-local route to/from more/longer wires –boundaries in net
Admin For POWER: Update on –rcore simulation –HSRA energy –Jsim size problems? Fix in works
Control As before –How many controllers? –How many pinsts slaved off of each? Common classes: –Single Controller / Lock-step –Decoupled, datastream –Autonomous MIMD
Control: Lockstep Master controller (usually processor) –issues instruction (instruction tag) every cycle explicitly when device should operate –Single thread of control everything known to be in synch –Idle while processor doing other tasks –Ex. VLIW (TriMedia), PRISC, GARP
Control: Data Stream Configure then run on data decoupled from control processor –run parallel with processor processor run orthogonal tasks maybe several simultaneous tasks running on spatial –unit not typically fed by processor directly –need to synchronize data transfer and operation polling, interrupt, semaphore –Ex. Cheops, PADDI-2, Pleiades, SPLASH
Control: Autonomous (MIMD) Multiple (potential) control processors –not necessarily slaved –distribute control –more care in synchronization –Ex. Floe (MagicEight)
HSRA Multi-Hetero Coupling –unifying networks Balance –sequential/spatial –control units w/ management task
Configuration Share interface bus –config and data XC6200, PAM, SPLASH –config and memory path GARP Separate path / network –VLIW, Pleiades Explicit –XC6200, PAM, SPLASH, …. Implicit –GARP/PRISC
Mapping –often option on where runs –must sort out what goes where faster in one resource? …but limited number of each resource
Mapping: Limited Resource What runs on faster, limited resource? –E.g. Tim’s C extraction last-time –General: what allocated to resource when reconfigure N candidate ops -> each choice –Greedy Break into temporal regions –local working set and points of reconfiguration while resource available –add op offering most benefit
Mapping: Spatial Choice Different kinds of resources –(e.g. LUTs, multipliers) Multiple resources can solve same problem Limited number of each resource Match users with resources
Mapping: Bipartite Partitioning =>Bipartite matching –deal with unit resource consumption –also w/ regional/interconnect constraints –not directly deal with performance… Postpass(?) allocate –faster resources to critical path ? N of R1 vs. M of R2 Example/Details: Liu FPGA98
Mapping More common: –can solve with: 12A’s and 2B’s or 4 A’s and 4 B’s –common need 4 A’s and 2 B’s –choice 8 A’s vs. 2 B’s
Highlights Fit into existing framework –not that much new here –new issue: who and how share resources Issues: interconnect, control + density when hit balance - efficiency when balance mismatch - harder mapping (resource sharing)