VADA Lab.SungKyunKwan Univ. 1 Lower Power High Level Synthesis 1999. 8 성균관대학교 조 준 동 교수

Slides:



Advertisements
Similar presentations
ECE 667 Synthesis and Verification of Digital Circuits
Advertisements

Combinational Logic.
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Sequential Synthesis.
Chapter 4 Retiming.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s
Digital Design – Optimizations and Tradeoffs
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
System Partitioning Kris Kuchcinski
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
ICS 252 Introduction to Computer Design
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
1 ENTITY test is port a: in bit; end ENTITY test; DRC LVS ERC Circuit Design Functional Design and Logic Design Physical Design Physical Verification and.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
COE 561 Digital System Design & Synthesis Architectural Synthesis Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
Automated Design of Custom Architecture Tulika Mitra
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
L10 : Lower Power High Level Synthesis(1) 성균관대학교 조 준 동 교수
Sequential Execution Example of three micro-operations in the same clock period.
Retiming EECS 290A Sequential Logic Synthesis and Verification.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Register Transfer Specification And Design
VLSI Testing Lecture 5: Logic Simulation
Architecture & Organization 1
Introduction to cosynthesis Rabi Mahapatra CSCE617
Architecture & Organization 1
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Architectural-Level Synthesis
Architecture Synthesis
HIGH LEVEL SYNTHESIS.
Resource Sharing and Binding
Presentation transcript:

VADA Lab.SungKyunKwan Univ. 1 Lower Power High Level Synthesis 성균관대학교 조 준 동 교수

VADA Lab.SungKyunKwan Univ. 2 System Partitioning To decide which components of the system will be realized in hardware and which will be implemented in software High-quality partitioning is critical in high-level synthesis. To be useful, high-level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance. To decide whether the system functions are distributed or not. Distributed processors, memories and controllers can lead to significant power savings. The drawback is the increase in area. E.g., a non-distributed and a distributed design of a vector quantizer.

VADA Lab.SungKyunKwan Univ. 3 Circuit Partitioning graph and physical representation

VADA Lab.SungKyunKwan Univ. 4 VHDL example process communication control/data flow graph Behavioral description

VADA Lab.SungKyunKwan Univ. 5 Clustering Example Two-cluster Partition Three-cluster Partition

VADA Lab.SungKyunKwan Univ. 6 Clustering (Cont’d)

VADA Lab.SungKyunKwan Univ. 7 Multilevel Kernighan-Lin Note that we can take node weights into account by letting the weight of a node (i,j) in Nc be the sum of the weights of the nodes I and j. We can similarly take edge weights into account by letting the weight of an edge in Ec be the sum of the weights of the edges "collapsed" into it. Furthermore, we can choose the edge (i,j) which matches j to i in the construction of Nc above to have the large weight of all edges incident on i; this will tend to minimize the weights of the cut edges. This is called heavy edge matching in METIS, and is illustrated on the right.

VADA Lab.SungKyunKwan Univ. 8 Multilevel Kernighan-Lin Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below: Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin.

VADA Lab.SungKyunKwan Univ. 상위 수준 합성 단계

VADA Lab.SungKyunKwan Univ. ( High Level Synthesis ) 상위 수준 합성 ( High Level Synthesis ) Instructions Operations Variables Arrays signals 회로의 동작적 기술 Control Datapath Memory Operators, Registers, Memory, Multiplexor Control scheduling Memory inferencing Register sharing Control interencing clk); if(fgb[I]%8; begin p=rgb[I]%8; g=filter(x,y)*8; end constraints RTL(register transfer level) architecture 상위 수준 합성

VADA Lab.SungKyunKwan Univ. 11 High-Level Synthesis The allocation task determines the type and quantity of resources used in the RTL design. It also determines the clocking scheme, memory hierarchy and pipelining style. To perform the required trade- offs, the allocation task must determine the exact area and performance values. The scheduling task schedules operations and memory references into clock cycles. If the number of clock cycles is a constraint, the scheduler has to produce a design with the fewest functional units The binding task assigns operations and memory references within each clock cycle to available hardware units. A resource can be shared by different operations if they are mutually exclusive, i.e. they will never execute simultaneously.

VADA Lab.SungKyunKwan Univ. 상위 수준 합성 과정 예

VADA Lab.SungKyunKwan Univ. 13 Low Power Scheduling

VADA Lab.SungKyunKwan Univ. 상위 레벨에서 제안된 저전력 방법 Sibling 연산의 연산자 공유 [ Fang, 96 ] 데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ] FU 의 shut down 방법 ( Demand-driven operation ) [ Alidina, 94 ] 연산의 규칙성 이용 [ Rabaey, 96 ] Dual 전압 사용 [ Sarrafzadeh, 96 ] Spurious 연산의 최소화 [ Hwang, 96 ] 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조 단순화를 통한 캐패시턴스 최소화 [Cho,97]

VADA Lab.SungKyunKwan Univ. 레지스터의 전력 소모 모델 Power(Register) = switching(x)(C out, Mux +C in,Register )+switching(y) x (C out, Register +C in, DeMux ) switching(x)=switching(y) 이므로 Power(Register)=switching(y) x C total

VADA Lab.SungKyunKwan Univ. CDFG( control data flow graph ) *1 ab cd e g f h e=a+b; g=c+d; f=e+b; h=f*g; 회로의 CDFG 표현

VADA Lab.SungKyunKwan Univ. 17 Schematic to CDFG of FIR3

VADA Lab.SungKyunKwan Univ. 레지스터와 리소스의 수 결정 adcbefgh

VADA Lab.SungKyunKwan Univ. 19 High-Level Power Estimation P core = P DP + P MEM + P CNTR + P PROC P DP = P REG +P MUX +P FU+ P INT, –where P REG is the power of the registers –P MUX is the power of multiplexers –P FU is the power of functional units –P INT is the power of physical interconnet capacitance

VADA Lab.SungKyunKwan Univ. 20 High-Level Power Estimation: P MUX and P FU

VADA Lab.SungKyunKwan Univ. 21 High-Level Power Estimation: P REG Compute the lifetimes of all the variables in the given VHDL code. Represent the lifetime of each variable as a vertical line from statement i through statement i + n in the column j reserved for the corresponding varibale v j. Determine the maximum number N of overlapping lifetimes computing the maximum number of vertical lines intersecting with any horizontal cut-line. Estimate the minimal number of N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i. Select a possible mapping of variables into registers by using register sharing Compute the number w i of write to the variables mapped to the same set of registers. Estimate n i of each set of register dividing w i by the number of statements S: i =w i /S; hence TR imax = n i f clk. Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers The non-switching power P NSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system. In total,

VADA Lab.SungKyunKwan Univ. 22 P CNTR After scheduling, the control is defined and optimized by the hardware mapper and further by the logic synthesis process before mapping to layout. Like interconnect, therefore, the control needs to be estimated statistically. Local control model: the local controller account for a larger percentage of the total capacitance than the global controller. Where N trans is the number of tansitions, N states is the number of states, C lc is the capacitance switched in any local controller in one sample period and B f is the ratio of the number of bus accesses to the number of busses. Global control model

VADA Lab.SungKyunKwan Univ. 23 N trans The number of transitions depends on assignment, scheduling, optimizations, logic optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs.

VADA Lab.SungKyunKwan Univ. 24 Factors of the coarse-grained model (obtained by switch level simulator)

VADA Lab.SungKyunKwan Univ. Low Power Scheduling and Binding (a) 저전력을 고려하지 않은 스케쥴링 (b) 저전력을 고려한 스케쥴링 M1 M2 M1

VADA Lab.SungKyunKwan Univ. 26 How much power reduction? The coarse-grained model provides a fast estimation of the power consumption when no information of the activity of the input data to the functional units is available.

VADA Lab.SungKyunKwan Univ. 27 Fine-grained model When information of the activity of the input data to the functional units is available.

VADA Lab.SungKyunKwan Univ. 28 Effect of the operand activity on the power consumption of an 8 X 8-bit Booth multiplier. AHD Input data

VADA Lab.SungKyunKwan Univ. 29 Loop Interchange If matrix A is laid out in memory in column-major form, execution order (a.2) implies more cache misses than the execution order in (b.2). Thus, the compiler chooses algorithm (b.1) to reduce the running time.

VADA Lab.SungKyunKwan Univ. 30 Motion Estimation

VADA Lab.SungKyunKwan Univ. 31 Motion Estimation (low power)

VADA Lab.SungKyunKwan Univ. 32 Matrix-vector product algorithm

VADA Lab.SungKyunKwan Univ. 33 Retiming Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit

VADA Lab.SungKyunKwan Univ. 34 Exploiting spatial locality for interconnect power reduction A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low. A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus). The partitioning information is passed to the architecture netlist and floorplanning tools. Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs

VADA Lab.SungKyunKwan Univ. 35 Hardware Mapping The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks. The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment). The mapping process transforms the flow graph into three structural sub-graphs: the data path structure graph the controller state machine graph the interface graph (between data path control inputs and the controller output signals)

VADA Lab.SungKyunKwan Univ. 36 Spectral Partitioning in High-Level Synthesis The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together. The relative distances is a measure of the tightness of connections. Use the eigenvector ordering to generate several partitioning solutions The area estimates are based on distribution graphs. A distribution graph displays the expected number of operations executed in each time slot. Local bus power: the number of global data transfers times the area of the cluster Global bus power: the number of global data transfer times the total area:

VADA Lab.SungKyunKwan Univ. 37 Finding a good Partition

VADA Lab.SungKyunKwan Univ. 38 Interconnection Estimation For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about % of the datapath height. Average global bus length : square root of the estimated chip area. The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically.

VADA Lab.SungKyunKwan Univ. 39 Datapath Generation Register file recognition and the multiplexer reduction: – Individual registers are merged as much as possible into register files –reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder). Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used) Data path partitioning is to optimize the processor floorplan The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.

VADA Lab.SungKyunKwan Univ. 40 Incorporating into HYPER-LP

VADA Lab.SungKyunKwan Univ. 41 Exploiting spatial locality for interconnect power reduction Global Local Adder1 Adder2

VADA Lab.SungKyunKwan Univ. 42 Experiments

VADA Lab.SungKyunKwan Univ. 43 Balancing maximal time-sharing and fully-parallel implementation A fourth-order parallel-form IIR filter (a) Local assignment (2 global transfers), (b) Non-local assignment (20 global transfers)

VADA Lab.SungKyunKwan Univ. 44 Retiming/pipelining for Critical path

VADA Lab.SungKyunKwan Univ. 45 Effective Resource Utilization 88

VADA Lab.SungKyunKwan Univ. 46 Hazard propagation elimination by clocked sampling By sampling a steady state signal at a register input, no more glitches are propagated through the next combinational logics.

VADA Lab.SungKyunKwan Univ. 47 Regularity Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.

VADA Lab.SungKyunKwan Univ. 48 Module Selection Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints. Full pipelining may not be effective: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path. Clustering is useful to map operations into non-pipelining hardware modules, such that the reusability of these modules over the complete computational graph be maximized. During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints

VADA Lab.SungKyunKwan Univ. 49 Estimation on the number of resources Estimate min and max bounds on the required resources to – delimit the design space min bounds to serve as an initial solution – serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations Max bound on execution time is t max : topological ordering of DFG using ASAP and ALAP Minimum bounds on the number of resources for each resource class Where N Ri : the number of resources of class R i d Ri : the duration of a single operation O Ri : the number of operations

VADA Lab.SungKyunKwan Univ. 50 Exploring the Design Space Find the minimal area solution constrained to the timing constraints By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied. After acceptable graph is obtained, the resource allocation process is initiated. change the available hardware (FU's, registers, busses) redistribute the time allocation over the sub-graphs transform the graph to reduce the hardware requirements. Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.

VADA Lab.SungKyunKwan Univ. 51 Data path Synthesis After Module Selection we have:

VADA Lab.SungKyunKwan Univ. 52 Scheduling and Binding The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle Sharing: Bind a resource to more than one operation. –Operations must not execute concurrently. Graph scheduled hierachically in a bottom-up fashion Power tradeoffs –Shorter schedules enable supply voltage (Vdd) scaling –Schedule directly impacts resource sharing –Energy consumption depends what the previous instruction was –Reordering to minimize the switching on the control path Clock selection –Eliminate slacks –Choose optimal system clock period

VADA Lab.SungKyunKwan Univ. 53 ASAP Scheduling AlgorithmHAL Example

VADA Lab.SungKyunKwan Univ. 54 Algorithm ALAP Scheduling HAL Example

VADA Lab.SungKyunKwan Univ. 55 Force Directed Scheduling (Latency-constrained Minimum Resource) Used as priority function. Force is related to concurrency. Sort operations for least force. Mechanical analogy: Force = constant displacement. constant = operation-type distribution. displacement = change in probability. q mult q alu l l

VADA Lab.SungKyunKwan Univ. 56 Force Directed Scheduling

VADA Lab.SungKyunKwan Univ. 57 Example : Operation V 6

VADA Lab.SungKyunKwan Univ. 58 Force-Directed Scheduling Algorithm (Paulin)

VADA Lab.SungKyunKwan Univ. 59 Force-Directed Scheduling Example Probability of scheduling operations into control steps Probability of scheduling operations into control steps after operation o 3 is scheduled to step s 2 Operator cost for multiplications in a Operator cost for multiplications in c

VADA Lab.SungKyunKwan Univ. 60 List Scheduling (Resource-constrained minimum latency) The scheduled DFG DFG with mobility labeling (inside <>) ready operation list/resource constraint

VADA Lab.SungKyunKwan Univ. 61 Static-List Scheduling DFG Partial schedule of five nodes Priority list The final schedule

VADA Lab.SungKyunKwan Univ. 62 Divide-and-Conquer to minimize the power consumption Decompose a computation into strongly connected components Any adjacent trivial SCCs are merged into a sub part; Use pipelining to isolate the sub parts; For each sub part –Minimize the number of delays using retiming; –If (the sub part is linear) Apply optimal unfolding; –Else Apply unfolding after the isolation of nonlinear operations; Merge linear sub parts to further optimize; Schedule merged sub parts to minimize memory usage

VADA Lab.SungKyunKwan Univ. 63 SCC decomposition step Using the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.

VADA Lab.SungKyunKwan Univ. 64 Choosing Optimal Clock Period

VADA Lab.SungKyunKwan Univ. 65 Supply Voltage Scaling Lowering Vdd reduces energy, but increase delays

VADA Lab.SungKyunKwan Univ. 66 Multiple Supply Voltages Filter Example

VADA Lab.SungKyunKwan Univ. 67 Shut-down 을 이용한 Scheduling: |a-b|

VADA Lab.SungKyunKwan Univ. 68 Loop Scheduling Sequential Execution Partial loop unrolling Loop folding

VADA Lab.SungKyunKwan Univ. 69 Loop folding Reduce execution delay of a loop. Pipeline operations inside a loop. Overlap execution of operations. Need a prologue and epilogue. Use pipeline scheduling for loop graph model.

VADA Lab.SungKyunKwan Univ. 70 DFG Restructuring DFG2 DFG2 after redundant operation insertion

VADA Lab.SungKyunKwan Univ. 71 Minimizing the bit transitions for constants during Scheduling

VADA Lab.SungKyunKwan Univ. 72 Control Synthesis Synthesize circuit that: Executes scheduled operations. Provides synchronization. Supports: Iteration. Branching. Hierarchy. Interfaces.

VADA Lab.SungKyunKwan Univ. 73 Allocation ◆ Bind a resource to more than one operation.: (type,id)

VADA Lab.SungKyunKwan Univ. 74 Optimum binding Compatibility graph

VADA Lab.SungKyunKwan Univ. 75 Coloring on Conflict Graph

VADA Lab.SungKyunKwan Univ. 76 RESOURCE SHARING Parallel vs. time-sharing buses (or execution units) Resource sharing can destroy signal correlations and increase switching activity, should be done between operations that are strongly connected. Map operations with correlated input signals to the same units Regularity: repeated patterns of computation (e.g., (+, * ), ( *,*), (+,>)) simplifying interconnect (busses, multiplexers, buffers)