VADA Lab.SungKyunKwan Univ. 1 Lower Power High Level Synthesis 성균관대학교 조 준 동 교수
VADA Lab.SungKyunKwan Univ. 2 System Partitioning To decide which components of the system will be realized in hardware and which will be implemented in software High-quality partitioning is critical in high-level synthesis. To be useful, high-level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance. To decide whether the system functions are distributed or not. Distributed processors, memories and controllers can lead to significant power savings. The drawback is the increase in area. E.g., a non-distributed and a distributed design of a vector quantizer.
VADA Lab.SungKyunKwan Univ. 3 Circuit Partitioning graph and physical representation
VADA Lab.SungKyunKwan Univ. 4 VHDL example process communication control/data flow graph Behavioral description
VADA Lab.SungKyunKwan Univ. 5 Clustering Example Two-cluster Partition Three-cluster Partition
VADA Lab.SungKyunKwan Univ. 6 Clustering (Cont’d)
VADA Lab.SungKyunKwan Univ. 7 Multilevel Kernighan-Lin Note that we can take node weights into account by letting the weight of a node (i,j) in Nc be the sum of the weights of the nodes I and j. We can similarly take edge weights into account by letting the weight of an edge in Ec be the sum of the weights of the edges "collapsed" into it. Furthermore, we can choose the edge (i,j) which matches j to i in the construction of Nc above to have the large weight of all edges incident on i; this will tend to minimize the weights of the cut edges. This is called heavy edge matching in METIS, and is illustrated on the right.
VADA Lab.SungKyunKwan Univ. 8 Multilevel Kernighan-Lin Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below: Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin.
VADA Lab.SungKyunKwan Univ. 상위 수준 합성 단계
VADA Lab.SungKyunKwan Univ. ( High Level Synthesis ) 상위 수준 합성 ( High Level Synthesis ) Instructions Operations Variables Arrays signals 회로의 동작적 기술 Control Datapath Memory Operators, Registers, Memory, Multiplexor Control scheduling Memory inferencing Register sharing Control interencing clk); if(fgb[I]%8; begin p=rgb[I]%8; g=filter(x,y)*8; end constraints RTL(register transfer level) architecture 상위 수준 합성
VADA Lab.SungKyunKwan Univ. 11 High-Level Synthesis The allocation task determines the type and quantity of resources used in the RTL design. It also determines the clocking scheme, memory hierarchy and pipelining style. To perform the required trade- offs, the allocation task must determine the exact area and performance values. The scheduling task schedules operations and memory references into clock cycles. If the number of clock cycles is a constraint, the scheduler has to produce a design with the fewest functional units The binding task assigns operations and memory references within each clock cycle to available hardware units. A resource can be shared by different operations if they are mutually exclusive, i.e. they will never execute simultaneously.
VADA Lab.SungKyunKwan Univ. 상위 수준 합성 과정 예
VADA Lab.SungKyunKwan Univ. 13 Low Power Scheduling
VADA Lab.SungKyunKwan Univ. 상위 레벨에서 제안된 저전력 방법 Sibling 연산의 연산자 공유 [ Fang, 96 ] 데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ] FU 의 shut down 방법 ( Demand-driven operation ) [ Alidina, 94 ] 연산의 규칙성 이용 [ Rabaey, 96 ] Dual 전압 사용 [ Sarrafzadeh, 96 ] Spurious 연산의 최소화 [ Hwang, 96 ] 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97]
VADA Lab.SungKyunKwan Univ. 레지스터의 전력 소모 모델 Power(Register) = switching(x)(C out, Mux +C in,Register )+switching(y) x (C out, Register +C in, DeMux ) switching(x)=switching(y) 이므로 Power(Register)=switching(y) x C total
VADA Lab.SungKyunKwan Univ. CDFG( control data flow graph ) *1 ab cd e g f h e=a+b; g=c+d; f=e+b; h=f*g; 회로의 CDFG 표현
VADA Lab.SungKyunKwan Univ. 17 Schematic to CDFG of FIR3
VADA Lab.SungKyunKwan Univ. 레지스터와 리소스의 수 결정 adcbefgh
VADA Lab.SungKyunKwan Univ. 19 High-Level Power Estimation P core = P DP + P MEM + P CNTR + P PROC P DP = P REG +P MUX +P FU+ P INT, –where P REG is the power of the registers –P MUX is the power of multiplexers –P FU is the power of functional units –P INT is the power of physical interconnet capacitance
VADA Lab.SungKyunKwan Univ. 20 High-Level Power Estimation: P MUX and P FU
VADA Lab.SungKyunKwan Univ. 21 High-Level Power Estimation: P REG Compute the lifetimes of all the variables in the given VHDL code. Represent the lifetime of each variable as a vertical line from statement i through statement i + n in the column j reserved for the corresponding varibale v j. Determine the maximum number N of overlapping lifetimes computing the maximum number of vertical lines intersecting with any horizontal cut-line. Estimate the minimal number of N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i. Select a possible mapping of variables into registers by using register sharing Compute the number w i of write to the variables mapped to the same set of registers. Estimate n i of each set of register dividing w i by the number of statements S: i =w i /S; hence TR imax = n i f clk. Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers The non-switching power P NSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system. In total,
VADA Lab.SungKyunKwan Univ. 22 P CNTR After scheduling, the control is defined and optimized by the hardware mapper and further by the logic synthesis process before mapping to layout. Like interconnect, therefore, the control needs to be estimated statistically. Local control model: the local controller account for a larger percentage of the total capacitance than the global controller. Where N trans is the number of tansitions, N states is the number of states, C lc is the capacitance switched in any local controller in one sample period and B f is the ratio of the number of bus accesses to the number of busses. Global control model
VADA Lab.SungKyunKwan Univ. 23 N trans The number of transitions depends on assignment, scheduling, optimizations, logic optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs.
VADA Lab.SungKyunKwan Univ. 24 Factors of the coarse-grained model (obtained by switch level simulator)
VADA Lab.SungKyunKwan Univ. Low Power Scheduling and Binding (a) 저전력을 고려하지 않은 스케쥴링 (b) 저전력을 고려한 스케쥴링 M1 M2 M1
VADA Lab.SungKyunKwan Univ. 26 How much power reduction? The coarse-grained model provides a fast estimation of the power consumption when no information of the activity of the input data to the functional units is available.
VADA Lab.SungKyunKwan Univ. 27 Fine-grained model When information of the activity of the input data to the functional units is available.
VADA Lab.SungKyunKwan Univ. 28 Effect of the operand activity on the power consumption of an 8 X 8-bit Booth multiplier. AHD Input data
VADA Lab.SungKyunKwan Univ. 29 Loop Interchange If matrix A is laid out in memory in column-major form, execution order (a.2) implies more cache misses than the execution order in (b.2). Thus, the compiler chooses algorithm (b.1) to reduce the running time.
VADA Lab.SungKyunKwan Univ. 30 Motion Estimation
VADA Lab.SungKyunKwan Univ. 31 Motion Estimation (low power)
VADA Lab.SungKyunKwan Univ. 32 Matrix-vector product algorithm
VADA Lab.SungKyunKwan Univ. 33 Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit
VADA Lab.SungKyunKwan Univ. 34 Exploiting spatial locality for interconnect power reduction A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low. A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus). The partitioning information is passed to the architecture netlist and floorplanning tools. Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs
VADA Lab.SungKyunKwan Univ. 35 Hardware Mapping The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks. The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment). The mapping process transforms the flow graph into three structural sub-graphs: the data path structure graph the controller state machine graph the interface graph (between data path control inputs and the controller output signals)
VADA Lab.SungKyunKwan Univ. 36 Spectral Partitioning in High-Level Synthesis The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together. The relative distances is a measure of the tightness of connections. Use the eigenvector ordering to generate several partitioning solutions The area estimates are based on distribution graphs. A distribution graph displays the expected number of operations executed in each time slot. Local bus power: the number of global data transfers times the area of the cluster Global bus power: the number of global data transfer times the total area:
VADA Lab.SungKyunKwan Univ. 37 Finding a good Partition
VADA Lab.SungKyunKwan Univ. 38 Interconnection Estimation For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about % of the datapath height. Average global bus length : square root of the estimated chip area. The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically.
VADA Lab.SungKyunKwan Univ. 39 Datapath Generation Register file recognition and the multiplexer reduction: – Individual registers are merged as much as possible into register files –reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder). Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used) Data path partitioning is to optimize the processor floorplan The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.
VADA Lab.SungKyunKwan Univ. 40 Incorporating into HYPER-LP
VADA Lab.SungKyunKwan Univ. 41 Exploiting spatial locality for interconnect power reduction Global Local Adder1 Adder2
VADA Lab.SungKyunKwan Univ. 42 Experiments
VADA Lab.SungKyunKwan Univ. 43 Balancing maximal time-sharing and fully-parallel implementation A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers)
VADA Lab.SungKyunKwan Univ. 44 Retiming/pipelining for Critical path
VADA Lab.SungKyunKwan Univ. 45 Effective Resource Utilization 88
VADA Lab.SungKyunKwan Univ. 46 Hazard propagation elimination by clocked sampling By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics.
VADA Lab.SungKyunKwan Univ. 47 Regularity Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.
VADA Lab.SungKyunKwan Univ. 48 Module Selection Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. Full pipelining may not be effective: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. Clustering is useful to map operations into non-pipelining hardware modules, such that the reusability of these modules over the complete computational graph be maximized. During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints
VADA Lab.SungKyunKwan Univ. 49 Estimation on the number of resources Estimate min and max bounds on the required resources to – delimit the design space min bounds to serve as an initial solution – serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations Max bound on execution time is t max : topological ordering of DFG using ASAP and ALAP Minimum bounds on the number of resources for each resource class Where N Ri : the number of resources of class R i d Ri : the duration of a single operation O Ri : the number of operations
VADA Lab.SungKyunKwan Univ. 50 Exploring the Design Space Find the minimal area solution constrained to the timing constraints By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. After acceptable graph is obtained, the resource allocation process is initiated. change the available hardware (FU's, registers, busses) redistribute the time allocation over the sub-graphs transform the graph to reduce the hardware requirements. Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.
VADA Lab.SungKyunKwan Univ. 51 Data path Synthesis After Module Selection we have:
VADA Lab.SungKyunKwan Univ. 52 Scheduling and Binding The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle Sharing: Bind a resource to more than one operation. –Operations must not execute concurrently. Graph scheduled hierachically in a bottom-up fashion Power tradeoffs –Shorter schedules enable supply voltage (Vdd) scaling –Schedule directly impacts resource sharing –Energy consumption depends what the previous instruction was –Reordering to minimize the switching on the control path Clock selection –Eliminate slacks –Choose optimal system clock period
VADA Lab.SungKyunKwan Univ. 53 ASAP Scheduling AlgorithmHAL Example
VADA Lab.SungKyunKwan Univ. 54 Algorithm ALAP Scheduling HAL Example
VADA Lab.SungKyunKwan Univ. 55 Force Directed Scheduling (Latency-constrained Minimum Resource) Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability. q mult q alu l l
VADA Lab.SungKyunKwan Univ. 56 Force Directed Scheduling
VADA Lab.SungKyunKwan Univ. 57 Example : Operation V 6
VADA Lab.SungKyunKwan Univ. 58 Force-Directed Scheduling Algorithm (Paulin)
VADA Lab.SungKyunKwan Univ. 59 Force-Directed Scheduling Example Probability of scheduling operations into control steps Probability of scheduling operations into control steps after operation o 3 is scheduled to step s 2 Operator cost for multiplications in a Operator cost for multiplications in c
VADA Lab.SungKyunKwan Univ. 60 List Scheduling (Resource-constrained minimum latency) The scheduled DFG DFG with mobility labeling (inside <>) ready operation list/resource constraint
VADA Lab.SungKyunKwan Univ. 61 Static-List Scheduling DFG Partial schedule of five nodes Priority list The final schedule
VADA Lab.SungKyunKwan Univ. 62 Divide-and-Conquer to minimize the power consumption Decompose a computation into strongly connected components Any adjacent trivial SCCs are merged into a sub part; Use pipelining to isolate the sub parts; For each sub part –Minimize the number of delays using retiming; –If (the sub part is linear) Apply optimal unfolding; –Else Apply unfolding after the isolation of nonlinear operations; Merge linear sub parts to further optimize; Schedule merged sub parts to minimize memory usage
VADA Lab.SungKyunKwan Univ. 63 SCC decomposition step Using the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.
VADA Lab.SungKyunKwan Univ. 64 Choosing Optimal Clock Period
VADA Lab.SungKyunKwan Univ. 65 Supply Voltage Scaling Lowering Vdd reduces energy, but increase delays
VADA Lab.SungKyunKwan Univ. 66 Multiple Supply Voltages Filter Example
VADA Lab.SungKyunKwan Univ. 67 Shut-down 을 이용한 Scheduling: |a-b|
VADA Lab.SungKyunKwan Univ. 68 Loop Scheduling Sequential Execution Partial loop unrolling Loop folding
VADA Lab.SungKyunKwan Univ. 69 Loop folding Reduce execution delay of a loop. Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model.
VADA Lab.SungKyunKwan Univ. 70 DFG Restructuring DFG2 DFG2 after redundant operation insertion
VADA Lab.SungKyunKwan Univ. 71 Minimizing the bit transitions for constants during Scheduling
VADA Lab.SungKyunKwan Univ. 72 Control Synthesis Synthesize circuit that: Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces.
VADA Lab.SungKyunKwan Univ. 73 Allocation ◆ Bind a resource to more than one operation.: (type,id)
VADA Lab.SungKyunKwan Univ. 74 Optimum binding Compatibility graph
VADA Lab.SungKyunKwan Univ. 75 Coloring on Conflict Graph
VADA Lab.SungKyunKwan Univ. 76 RESOURCE SHARING Parallel vs. time-sharing buses (or execution units) Resource sharing can destroy signal correlations and increase switching activity, should be done between operations that are strongly connected. Map operations with correlated input signals to the same units Regularity: repeated patterns of computation (e.g., (+, * ), ( *,*), (+,>)) simplifying interconnect (busses, multiplexers, buffers)