L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
Matrix-vector product algorithm
Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit
Exploiting spatial locality for interconnect power reduction Global Local Adder1 Adder2
Balancing maximal time-sharing and fully-parallel implementation A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers)
Retiming/pipelining for Critical path
Effective Resource Utilization
Hazard propagation elimination by clocked sampling By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics.
Regularity Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.
Module Selection Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized. During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints
Estimation Estimate min and max bounds on the required resources to – delimit the design space min bounds to serve as an initial solution – serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations Max bound on execution time is t max : topological ordering of DFG using ASAP and ALAP Minimum bounds on the number of resources for each resource class Where N Ri : the number of resources of class R i d Ri : the duration of a single operation O Ri : the number of operations
Exploring the Design Space Find the minimal area solution constrained to the timing constraints By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. After acceptable graph is obtained, the resource allocation process is initiated. – change the available hardware (FU's, registers, busses) –redistribute the time allocation over the sub-graphs –transform the graph to reduce the hardware requirements. Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.
Data path Synthesis
Scheduling and Binding The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle Sharing: Bind a resource to more than one operation. –Operations must not execute concurrently. Graph scheduled hierachically in a bottom-up fashion Power tradeoffs –Shorter schedules enable supply voltage (Vdd) scaling –Schedule directly impacts resource sharing –Energy consumption depends what the previous instruction was –Reordering to minimize the switching on the control path Clock selection –Eliminate slacks –Choose optimal system clock period
ASAP Scheduling AlgorithmHAL Example
Algorithm ALAP Scheduling HAL Example
Force Directed Scheduling Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability.
Force Directed Scheduling
Example : Operation V 6
Force-Directed Scheduling Algorithm (Paulin)
Force-Directed Scheduling Example Probability of scheduling operations into control steps Probability of scheduling operations into control steps after operation o 3 is scheduled to step s 2 Operator cost for multiplications in a Operator cost for multiplications in c
List Scheduling The scheduled DFG DFG with mobility labeling (inside <>) ready operation list/resource constraint
Static-List Scheduling DFG Partial schedule of five nodes Priority list The final schedule
Divide-and-Conquer to minimize the power consumption Decompose a computation into strongly connected components Any adjacent trivial SCCs are merged into a sub part; Use pipelining to isolate the sub parts; For each sub part –Minimize the number of delays using retiming; –If (the sub part is linear) Apply optimal unfolding; –Else Apply unfolding after the isolation of nonlinear operations; Merge linear sub parts to further optimize; Schedule merged sub parts to minimize memory usage
Choosing Optimal Clock Period
SCC decomposition step Using the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.
Idetifying SCC The first step of the approach is to identify the computation's strongly connected components,.
Choosing Optimal Clock Period
Supply Voltage Scaling Lowering Vdd reduces energy, but increase delays
Multiple Supply Voltages Filter Example
Shut-down 을 이용한 Scheduling: |a-b|
Loop Scheduling Sequential Execution Partial loop unrolling Loop folding
Reduce execution delay of a loop. Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model.
DFG Restructuring DFG2 DFG2 after redundant operation insertion
Minimizing the bit transitions for constants during Scheduling
Control Synthesis Synthesize circuit that: Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces.
Allocation ◆ Bind a resource to more than one operation.
Optimum binding
Example