Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal

Outline Introduction coarse-grained reconfigurable architectures coarse-grained reconfigurable architectures core problem: exploiting parallelism core problem: exploiting parallelism modulo scheduling problem modulo scheduling problem Compiler Framework Modulo Scheduling Algorithm Conclusions and Future Work

Example of Coarse-Grained Architectures: MorphoSys Topology of MorphoSys Architecture of a Reconfigurable Cell Ming-Hau Lee et al., University of California, Irvine Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver...

Core Problem: Exploiting Parallelism Which parallelism makes difference?  Instruction-level parallelism  limited parallelism (constrained by dependence)  VLIW does a good job  Task(thread)-level parallelism  hard to automate  lack support in coarse-grained architectures Loop-level parallelism (pipelining)  fit into coarse-grained architectures  higher parallelism than ILP

Pipelining Using Modulo Scheduling Modulo Scheduling (general): A way of pipelining Iterations are overlapped Each iteration is initiated at a fixed interval (II) For coarse-grained architectures: Where to place an operation? (placement) When to schedule an operation? (scheduling) How to connect operations? (routing) Modulo constraints

Modulo Scheduling Problem (cont.) n1 n2n3 n4 fu1 fu3fu4fu2 Iteration 1 t=0 t=1 t=2 t=3 t=4 b) space-time representation steady state (kernel) n2 n1 n3 n4 prologue epilogue II = 1 Pipeline stages = 3 4 operations/cycle for kernel n1 n2n3 n4 fu1fu2 fu3fu4 dataflow graph 2x2 matrix a) an example n2n4 n3 n1 n2 n3 n4

Outline Introduction Compiler Framework structure of compiler architecture description and abstraction Modulo Scheduling Algorithm Conclusion and Future Work

The Structure of DRESC Compiler C program IMPACT Frontend Lcode IR Dataflow Analysis & Transformation Modulo Scheduling Algorithm Architecture Description Architecture Abstraction Simulator Under development Architecture Parser DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler External tool

The Target Architecture Template FU muxcmuxa muxb RF out1out2 in src1src2 pred dst1pred_dst2 pred _dst1 Configuration RAM reg Example of FU and register file Examples of topology Generalizing common features of other architectures Using an XML-based language to specify topology, resource allocation, operations and timing

Architecture Description and Abstraction XML-based architecture description Architecture Parser Architecture Abstraction MRRG representation Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of: Modulo reservation table (MRT) from VLIW compilation Routing resource graph from FPGA P&R Specify resource allocation, operation binding, topology and timing.

Definitions of MRRG MRRG is defined as 3-tuple: G = {V, E, II} v = (r, t), r is related to resource, t refers to time stamp E = {(v m, v n ) | t(v m ) <= t(v n )} II = initiation interval Important properties: modulo: if node (r, t j ) is used, all the nodes {(r, t k )| t j mod II = t k mod II} are used too asymmetric: no route from v i to v j, if t(v i ) > t(v j ) Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG

Transform Components to MRRG Register allocation is transformed to part of P&R problem, implicitly solved by the modulo scheduling algorithm. register modeling is based on Roos2001 FU predsrc1src2 pred _dst1 pred _dst2 dst source predsrc1src2 sink pred_dst1pred_dst2dst RF in out1out2 cycle1 cycle2 in out1out2 in out1out2 cap

Outline Introduction Compiler Framework Modulo Scheduling Algorithm combined placement and routing congestion negotiation simulated annealing results and related work Conclusions and Future Work

Combined Placement and Routing Space-time routing resource graph can ’ t guarantee routability during the placement Rip-Up op Routing Success? No Yes Init Placement &Routing Re-placement n1 n2 ? For normal FPGA P&R: LUT1 LUT2 switch block LUT

Proposed Algorithm Rip-Up op Success? No Yes Init P&R, Penalty Re-P&R op UpdatePenalty InitTemperature Evaluate New P&R Accept? Restore op UpdateTemperature No Yes 1. Sort the operations 2. For each II, first generate initial schedule which respects dependency constraints only. 3. The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule At every iteration, an operation is ripped up from the existing schedule and is placed randomly Connected nets are rerouted accordingly A cost function (next slide) is computed to evaluate the new placement and routing Simulated annealing strategy is used to decide whether we accept new placement or not.

Cost Function Allow to overuse resources during P&R The cost of using one node is computed as follow: Rip-Up op Routing Success? No Yes Init P&R Re-placement base: base cost of the node in MRRG occ: occupancy cap: capacity of the node p: penalty factor penalty is increased over time as follow: UpdatePenalty InitPenalty

Parameters to Tune the Algorithm Ordering of operations techniques from Llosa2001 Relaxing factor of schedule length difficulty of moving operations VS. more pipeline stages Parameters of SA algorithm Costs associated with different resources register file get less base cost Penalty factor associated with overused resources compromise between scheduling quality and speed...

Scheduling Results Scheduling results on a 8x8 matrix resembles topology of Morphosys Algorithm limitations:  Scheduling speed is relatively slow  Scheduling quality still has space to improve  Can’t handle pipelined FUs  Can only handle the inner loop of a loop nest

Related Work  Modulo scheduling on clustered VLIWs  problem is simpler in nature (no routing).  RaPiD, Garp  row-based architecture and scheduling techniques.  no multiplexing  PipeRench  ring-like architecture is very specific, scheduling techniques are not general  Z. Huang, S. Malik, DAC2002  either use a full cross-bar, or generate a dedicated datapath for several loops for pipelining

Outline Introduction Compiler Framework Exploiting Parallelism Conclusions and Future Work

Conclusions: Coarse-grained architectures have distinct features. Compilers are possible and needed Loop-level parallelism is the right one for coarse-grained reconfigurable architectures A novel modulo scheduling algorithm and an abstract architecture representations are developed Future Work: Improve quality and speed of scheduling algorithm Enlarge the scope of pipelineable loops Techniques to reduce the bottleneck of pipelineable loops, e.g., taking into account of distributed memory

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,

Similar presentations

Presentation on theme: "Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,

Similar presentations

Presentation on theme: "Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,"— Presentation transcript:

Similar presentations

About project

Feedback