Download presentation
Presentation is loading. Please wait.
1
Generation of CDFGs from Scheduled and Pipelined Assembly Code The 18th International Workshop on Languages and Compilers for Parallel Computing October 20, 2005 David Zaretsky, Gaurav Mittal, Robert Dick, and Prith Banerjee Department of Electrical Engineering and Computer Science, Northwestern University College of Engineering, University of Illinois at Chicago
2
The Future of DSP Applications Recent advances in embedded communications and control systems are pushing the computational limits of DSP applications, driving the need for hardware/software co-design system. DSP performance requirements for new communication technologies Standard DSP Performance Roadmap DSP Operations per second (Billion MAC/s) 500 1000 1500 2000 2500 2000200120022003 2004 Voice over IP HDTV, MPEG4 Video over IP 3G Wireless / WCDMA 4G Wireless Future Broadband
3
Binary Translation Problems with high-level synthesis High-level application unavailable Hardware compiler unavailable Binary Translation Grammar Operation Latencies Software Pipelining Processor Architecture Limitations Functional Units Data Paths Physical Registers Memory Spilling Control and Data Flow Graphs Optimizations Scheduling Design decisions
4
FREEDOM: Bridging the Gap FREEDOM compiler automates the task of hw/sw partitioning for software binaries. FREEDOM is an acronym for: Fabrication of Reconfigurable Hardware Environments from DSP Optimized Machine Code FPGA designers unfamiliar with DSP concepts DSP designers not versed in FPGA design Assembly Binary DSP Design Environment VHDL Verilog RTL Simulation Logic Synthesis Place & Route Manually created RTL Models Verified RTL Models Netlist of Primitives ASIC / FPGA Design Environment
5
Related Work Binary Decompilation & Translation Cifuentes93/96/98 Kruegel04 Dehnert03 Stitt02/03 Dynamic Binary Optimizations Bala00 Gschwind00 Ye00 Levine03 Control and Data Flow Analysis Kastner02 Decker03 Amme00 Cooper02
6
Presentation Overview FREEDOM Compiler Infrastructure Data Dependency Analysis CDFG Generation from Scheduled Assembly Code Experimental Results Summary & Conclusions
7
The FREEDOM Compiler Common entry point for multiple assembly languages. Intermediate levels: Machine Language Syntax Tree Control & Data Flow Graph Hardware Description Language Architecture Description Language provides resource information for target FPGA architecture. Output: RTL VHDL/Verilog and testbench.
8
Machine Language Abstract Syntax Tree (MST) Generic language encapsulates most ISAs, including predicated and parallel instruction sets. All MST instructions are three-operand, predicated instructions: [pred] op src1 src2 dst Operand Types: Memory Address, Label, Register, Immediate. Operator types: Logical: AND, NAND, NEG, NOR, NOT, OR, XOR, SLL, SRL, etc. Arithmetic: ADD, DIV, MULT, SUB Branch: BEQ, BGEQ, BGT, BLEQ, BLT, BNEQ, GOTO, CALL Comparison: CMPEQ, CMPNE, CMPLT, CMPLE, CMPGT, CMPGE Assignment: LD, ST, MOVE, UNION General: NOP
9
Data Dependency Analysis MST instructions are assigned A timestep T An operation delay Each instruction in a parallel set is incremented by: T n = T + 0.01 * n Each instructions in an expanded set is incremented by: T m = T n + 0.0001 * m The write-back stage of an instruction is defined as: wb = timestep + delay TIMESTEP PC OP DELAY SRC1 SRC2 DST 1.0000 0X0020 MULT (2) $A4, 2, $A4 2.0000 0X0024 LD (5) *($A4), $A2 2.0100 0X0028 ADD (1) $A4, 4, $A2 3.0000 0X002c ADD (1) $A4, $A2, $A3
10
CDFG Generation from Scheduled Assembly Code Pipelined assembly code present difficulties in CDFG generation Complex control flows Varying data dependencies CDFG generation in 3 steps: Generate a Control Flow Graph Linearize Pipelined Operations Generate Data Flow Graph 0x0000 VECTORSUM: ZERO A7 0x0004 LDW *A4++, A6 0x0008 || B LOOP 0x000C LDW *A4++, A6 0x0010 || B LOOP 0x0014 LDW *A4++, A6 0x0018 || B LOOP 0x001C LDW *A4++, A6 0x0020 || B LOOP 0x0024 LDW *A4++, A6 0x0028 || B LOOP 0x002C || SUB A1, 4, A1 0x0030 LOOP: ADD A6, A7, A7 0x0034 || [A1] LDW *A4++, A6 0x0038 || [A1] SUB A1, 1, A1 0x003C || [A1] B LOOP 0x0040 STW A7, *A5 0x0044 NOP 4
11
Building a Control Flow Graph Based on work by K. Cooper et al, “Building a Control-Flow Graph from Scheduled Assembly Code,” Dept. of Computer Science, Rice University. Generates a CFG in O(n) time. Requires 3 Stages: Partition the code at labels into a set of basic blocks. Add edges between CFG blocks to represent normal flow of control. Iteratively propagate pipelined branch and counter information in a simulated control flow.
12
Event-Triggered Operations Analogous to a read/write pipeline architecture. Event trigger and execution stages are offset by operation delay (d). Implemented using a virtual shift register of size d. Event is triggered by assigning a ‘1’ to the highest bit (d-1). SRL operation is performed on the register in successive cycles. Event is executed after d cycles, when a ‘1’ appears in the zero bit.
13
Linearizing Pipelined Branch Operations Iteratively propagate pipelined branch and counter information in a simulated control flow. Trigger a change in control flow after a number of delay cycles. Only the event is propagated using the SRL operation. Copy of branch instruction inserted at each execution point. The branch is predicated on the event shift-register. Intersecting branch paths are merged by OR-ing predicates. The original branch instructions are replaced with NOPs.
14
Linearizing Pipelined Computational Operations Multi-cycle instructions are serialized into well-defined data flow paths along the pipeline. For an instruction with n delay slots, the value is propagated through virtual registers R n-1 R n, R n-2 R n-1, … R 0 R 1, where R 0 is the original register name. Each instruction in the sequence is guarded by a predicate on an event-triggering register bit. Intersecting data paths are merged by OR-ing predicates.
15
Building the Data Flow Graph DFG represents data dependencies in each MST procedure. DFG is generated using write-back times of MST instructions. DOTPROD: MVK.S1 500,A1 ZERO.L1 A7 MVK.S1 2000,A3 LOOP: LDW.D1 *A4++,A2 LDW.D1 *A3++,A5 NOP 4 MPY.M1 A2,A5,A6 SUB.S1 A1,1,A1 ADD.L1 A6,A7,A7 [A1] B.S2 LOOP NOP 5 STW.D1 A7,*A3
16
CDFG Optimizations Traditional Optimizations SSA Common Sub-Expression Copy Propagation Constant Propagation Constant Folding Strength Reduction Dead Code Elimination Loop Unrolling Register Allocation Custom Optimizations Identify I/O Ports Undefined Var Elimination Const Predicate Elimination Memory Forwarding Boolean Reduction Shift Reduction Block-Set Merging Empty Block Extraction
17
Experimental Results Each benchmark verified bit-true accurate using Modelsim. ~9 instructions were added for each pipelined operation. ~27% increase in code size during the linearization process. Values reflect the size of the design before CDFG optimizations.
18
Summary & Conclusions HLS compilers generally convert designs into CDFGs. Optimizations Scheduling Design decisions Generating CDFGs from pipelined and scheduled assembly code is complex. FREEDOM compiler generates CDFGs in three stages: Generate the control flow graph Linearize the assembly code Generate the data flow graph Verification on highly pipelined benchmarks show improved performance.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.