Download presentation
Presentation is loading. Please wait.
1
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing
2
Previously Homogenous model of computational array –single word granularity, depth, interconnect –all post-fabrication programmable Understand tradeoffs of each
3
Today Heterogeneous architectures –Why? –How? catalog of techniques fit in framework optimization and mapping
4
Why? Why would we be interested in heterogeneous architecture? –E.g.
5
Why? Applications have a mix of characteristics Already accepted –seldom can afford to build most general (unstructured) array bit-level, deep context, p=1 –=> are picking some structure to exploit May be beneficial to have portions of computations optimized for different structure conditions.
6
Examples Processor+FPGA Processors or FPGA add –multiplier or MAC unit –FPU –Motion Estimation coprocessor
7
Optimization Prospect Less capacity for composite than either pure –(A 1 +A 2 )T 12 < A 1 T 1 –(A 1 +A 2 )T 12 < A 2 T 2
8
Optimization Prospect Example Floating Point –Task: I integer Ops + F FP-ADDs –A proc =125M 2 –A FPU =40M 2 –I cycles / FP Ops = 60 –125(I+60F) 165(I+F) (7500-165)/40 = I/F 183 I/F
9
How? Design issues: –Interconnect space and time –Control –Instructions configuration path and control Mapping Cost/Benefits: –Costs Area, Power –Performance Bandwidth, latency
10
Interconnect Bus (degenerate network) Memory (shared retiming resource) RF/Coproc (traditional processor inter.) Network
11
Interconnect: Bus Minimal physical network –shared with memory and/or other peripherals –10s-100s of cycles away from processor (fpga) –low->moderate bandwidth –can handle multiple, different functional units but serial bottleneck of bus prevents simultaneous communication among devices
12
Interconnect: Bus Example XC6200
13
Interconnect: Memory Use memory (retiming) block to buffer data between heterogeneous regions –DMA (usually implies shared bus) –FIFO –dual port or shared RAM decoupled, moderate latency (10-100cycles) moderate bandwidth
14
Interconnect: Memory Example PAM, SPLASH
15
Interconnect: RF/Coproc Coupled directly to processor datapath –low latency (1-2 cycles) –moderately high bandwidth limited by RF ports and control
16
Interconnect: RF/Coproc Examples GARP, Chimaera –(more on this case Thursday)
17
Interconnect: Network Unify spatial network composing various heterogeneous components –high bandwidth –latency vary with distance –support simultaneous operate and data transfer –potentially dominant cost A interconnect > A function –Granularity question coarse (large blocks of each type) fine (interleaved)
18
Interconnect: Network Coarse Cheops, Pleiades
19
HSRA Heterogeneous Blocks
20
Interconnect: Network Coarse vs. Fine Multiplier/FPGA example
21
Interconnect: Network Coarse vs. Fine Fine –possibly share interconnect –locality –uniform tiling –not shared may get concentrations of heavy/no use interconnect limit use as independent resources –ratio less flexible? –More difficult design Coarse –flexible ratio –easier to keep dense homogeneous blocks –requires own interconnect –doesn’t disrupt base layout(s) –non-local route to/from more/longer wires –boundaries in net
22
Admin For POWER: Update on –rcore simulation –HSRA energy –Jsim size problems? Fix in works
23
Control As before –How many controllers? –How many pinsts slaved off of each? Common classes: –Single Controller / Lock-step –Decoupled, datastream –Autonomous MIMD
24
Control: Lockstep Master controller (usually processor) –issues instruction (instruction tag) every cycle explicitly when device should operate –Single thread of control everything known to be in synch –Idle while processor doing other tasks –Ex. VLIW (TriMedia), PRISC, GARP
25
Control: Data Stream Configure then run on data decoupled from control processor –run parallel with processor processor run orthogonal tasks maybe several simultaneous tasks running on spatial –unit not typically fed by processor directly –need to synchronize data transfer and operation polling, interrupt, semaphore –Ex. Cheops, PADDI-2, Pleiades, SPLASH
26
Control: Autonomous (MIMD) Multiple (potential) control processors –not necessarily slaved –distribute control –more care in synchronization –Ex. Floe (MagicEight)
27
HSRA Multi-Hetero Coupling –unifying networks Balance –sequential/spatial –control units w/ management task
28
Configuration Share interface bus –config and data XC6200, PAM, SPLASH –config and memory path GARP Separate path / network –VLIW, Pleiades Explicit –XC6200, PAM, SPLASH, …. Implicit –GARP/PRISC
29
Mapping –often option on where runs –must sort out what goes where faster in one resource? …but limited number of each resource
30
Mapping: Limited Resource What runs on faster, limited resource? –E.g. Tim’s C extraction last-time –General: what allocated to resource when reconfigure N candidate ops -> each choice –Greedy Break into temporal regions –local working set and points of reconfiguration while resource available –add op offering most benefit
31
Mapping: Spatial Choice Different kinds of resources –(e.g. LUTs, multipliers) Multiple resources can solve same problem Limited number of each resource Match users with resources
32
Mapping: Bipartite Partitioning =>Bipartite matching –deal with unit resource consumption –also w/ regional/interconnect constraints –not directly deal with performance… Postpass(?) allocate –faster resources to critical path ? N of R1 vs. M of R2 Example/Details: Liu FPGA98
33
Mapping More common: –can solve with: 12A’s and 2B’s or 4 A’s and 4 B’s –common need 4 A’s and 2 B’s –choice 8 A’s vs. 2 B’s
34
Highlights Fit into existing framework –not that much new here –new issue: who and how share resources Issues: interconnect, control + density when hit balance - efficiency when balance mismatch - harder mapping (resource sharing)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.