CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner

CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner
9/19/2018 High Level Synthesis CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner

Ant System Optimization: Overview
Ants work corporately on the graph Each creates a feasible solution Ants leave pheromones on their traces Ant make decisions partially on amount of pheromones Global Optimizations Evaporation: Pheromones dissipate over time Reinforcement: Update pheromones from good solutions Quickly converges to good solutions It is discovered that the ant is more biased to a path having more pheromone trails. Thus she is more likely to pick the “better” path. This will further enforce the better path, and leads future ants quickly converge to the optimal solution. ?

Solving Design Problems using AS
Problem model Define the solution space: create decision variables Pheromone model Global heuristic: Provides history of search space traversal Ant search strategy Local heuristic: Deterministic strategy for individual ant decision making Solution construction Probabilistically derive solution from local and global heuristics Feedback Evaluate solution quality, Reinforce good solutions (pheromones), Slightly evaporate all decisions (weakens poor solutions)

Autocatalytic Effect

Max-Min Ant System (MMAS) Scheduling
Problem: Some pheromones can overpower others leading to local minimums (premature convergence) Solution: Bound the strength of the pheromones If , always a chance to make any decision If , the decision is based solely on local heuristics, i.e. no past information is taken into account

MMAS RCS Formulation Idea: Combine ACO and List Scheduling
Ants determine priority list List scheduling framework evaluates the “goodness” of the list Global heuristics  permutation index Local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN)

RCS: List Scheduling A simple scheduling algorithm based on greedy strategies List scheduling algorithm: Construct a priority list based on some metrics (operation mobility, numbers of successors, etc) While not all operations scheduled For each available resource, select an operation in the ready list following the descending priority. Assign these operations to the current clock cycle Update the ready list Clock cycle ++ Qualities depend on benchmarks and particular metrics

MMAS RCS: Global and Local Heuristics
Global heuristic: Pheromones : the favorableness of selecting operation i to position j Global pheromone matrix Local heuristic: Local metrics : Instruction mobility, number of successors, etc Local decision making: a probabilistic decision Evaporate pheromone and reinforce good solution \tau should be very large at the very beginning, hence the ants just go everywhere to explore the search space

Pheromone Model For Instruction Scheduling
Each instruction opi  I associated with n pheromone trails where j = 1, …, n each indicates the favorableness of assign instruction i to position j op1 op2 op3 op4 op5 op6 Instructions 1 2 3 4 5 6 Priority List Each instruction also has a dynamic local heuristic Initially all  set to fixed value 0

Ant Search Strategy Each run has multiple iterations
Each iteration, multiple ants independently create their own priority list Fill one instruction at a time op1 op1 1 2 3 4 5 6 Priority List op4 op2 op2 op1 op3 op3 op5 op4 op4 op6 op5 op5 op2 op6 op6 op3 Instructions

Ant Search Strategy Each ant has memory about instructions already selected At step j ant has already selected j-1 instructions jth instruction selected probabilistically op1 op1 1 op4 op2 op2 op1 2 op3 op3 op5 3 op4 op4 4 op5 op5 5 op6 op6 6 Instructions Priority List

Ant Search Strategy ij(k) : global heuristic (pheromone) for selecting instruction i at j position j(k) : local heuristic – can use different properties Instruction mobility (IM) Instruction depth (ID) Latency weighted instruction depth (LWID) Successor number (SN) ,  control influence of global and local heuristics ij(k) : global heuristic on the favorableness if we color tj with ck j(k) : local heuristic on the favorableness if we color tj with ck . The local heuristic simply is the inverse of the local cost. In other words, it favors an assignment that yields smaller local cost. The numerator is the production of the global heuristic tau and local heuristic local heuristic eta.

Pheromone Update Lists constructed are evaluated with List Scheduling
Latency Lh for the result from ant h Evaporation – prevent stigmergy and punish “useless” trails Reinforcement – award trails with better quality

Pheromone Update Evaporation happens on all trails to avoid stigmergy
Reward the used trails based on the solution’s quality op1 op1 1 2 3 4 5 6 Priority List op4 op2 op2 op1 op3 op3 op5 op4 op4 op6 op5 op5 op2 op6 op6 op3 Instructions

Max-Min Ant System (MMAS)
Risks of Ant System optimization Positive feedback Dynamic range of pheromone trails can increase rapidly Unused trails can be repetitively punished which reduce their likelihood even more Premature convergence MMAS is designed to address this problem Built upon original AS Idea is to limit the pheromone trails within an evolving bound so that more broader exploration is possible Better balance the exploration and exploitation Prevent premature convergence P_best is controlling parameter, when p_best is very small, we let t_max == t_min, which means all options approximately have the same chance. Thus more emphasis is given to exploration. When p_best is set to close to 1, t_min becomes close to 0, which gives a bigger dynamic range for difference choice. Thus more emphasis on exploitation. We have implemented our system using both original AS and MMAS. The MMAS seems to consistently outperform the other one.

Max-Min Ant System (MMAS)
Limit (t) within min(t) and max(t) Sgb is the best global solution found so far at t-1 f(.) is the quality evaluation function, i.e. latency in our case avg is the average size of decision choices Pbest  (0,1] is the controlling parameter Conditional prob. of Sgb being selected when all trails in Sgb have max and others having min Smaller Pbest  tighter range for   more emphasis on exploration When Pbest  0, we set min  max P_best is controlling parameter, when p_best is very small, we let t_max == t_min, which means all options approximately have the same chance. Thus more emphasis is given to exploration. When p_best is set to close to 1, t_min becomes close to 0, which gives a bigger dynamic range for difference choice. Thus more emphasis on exploitation. We have implemented our system using both original AS and MMAS. The MMAS seems to consistently outperform the other one. p is the evaporation factor [0,1]

Other Algorithmic Refinements
Dynamically evolving local heuristics Example: dynamically adjust Instruction Mobility Benefit: reduce search space progressively Taking advantage of topological sorting of DFG when constructing priority list Each step ants select from the ready instructions instead from all unscheduled instructions Benefit: greatly reduce the search space We can mention that for a 11 node example, constructing list under topological sorting saves the search space from 11! to We can also mention GenLE, the tool we used to generate topologically ordered combinations.

MMAS RCS Algorithm

RCS Results: Pheromones (ARF)
The evolutionary effect on the global heuristics tau_ij is illustrated in Figure 3. It plots the pheromone values for the ARF testing sample after 100 iterations of the proposed algorithm. The x-axis is the index of instruction node in the DFG (shown in Figure 2), and the y-axis is the order index in the priority list passed to the list scheduler. There exist totally 30 nodes with node 1 and node 30 as the dummy source and sink of the DFG. Each dot in the diagram indicates the strength of the resultant pheromone trails for assigning corresponding order to a certain instruction – the bigger the size of the dot, the stronger the value of the pheromone. It is clearly seen from Figure 3 that there are a few strong pheromone trails while the remaining pheromone trails are very weak. This might be explained by the strong symmetric structure of the ARF DFG and the special implementation in our algorithm of considering instruction list only with topologically sorted order. It is also interesting to notice that though a good amount of instructions have a limited few alternative “good” positions (such as instruction 6 and 26), for some of the instructions the pheromone heuristics are strong enough to lock their positions. For example, according to its pheromone distribution, instruction 10 shall be placed as the 28-th item in the list and there is no other competitive position for its placement. After careful evaluation, this ordering preference cannot be trivially obtained by constructing priority lists with any of the popularly used heuristics mentioned above. This shows that the proposed algorithm has the possibility to discover better ordering which may be hard to achieve intuitively.

Benchmarks: ExpressDFG
A comprehensive benchmark for TCS/RCS Classic samples and more modern cases Comprehensive coverage Problem sizes Complexities Applications Downloadable from

Auto Regressive Filter

Cosine Transform

Matrix Inversion

RCS Experimental Results
Benchmark (nodes/edges) Resources CPLEX (latency /runtime) Force Directed List Scheduling MMAS-IS (average over 5 runs) IM ID LWID SN HAL(21/25) la, lfm, lm, 3i, 3o 8/32 8 9 ARF(28/30) 2a, lfm, 2m 11/22 11 13 EWF(34/47) la, lfm, lm 27 /24000 28 31 27.2 27 FIR1 (40/39) 2a, 2m, 3i, 3o 13/232 19 18 17.2 17 17.8 FIR2(44/43) 14/11560 21 16.2 16.4 COSINE 1(66/76) 2a,2m, lfm, 3i, 3o  20 17.4 18.2 17.6 COSINE2(82/91) 23 21.2 Average 19.3 20.5 18.5 16.8 17.0 16.9 17.1 Heterogeneous RCS = fast and slow multiplier Real life DFG examples commonly used in instruction scheduling study. Mention that we also examined MediaBench. For each of the benchmark samples, we run the proposed algorithm with different choices of local heuristics. For each choice, we perform 5 runs where in each run we allow 100 iterations. The number of ants per iteration is set to 5. The evaporation rate is configured to be The scaling parameters for global and local heuristics are set to be = = 1 and delivery rate Q = 1. The best schedule latency is reported at the end of each run and then the average value is reported as the performance for the corresponding setting. The proposed algorithm generates better results consistently over all testing cases. For some of the testing samples, it provides significant improvement on the schedule latency. The biggest saving achieved is 23%. This is obtained for the FIR2 using LWID. Comparing with Force-directed, a 6.2% average improvements, maximum 14.7%. For all the benchmarks with known optima, our algorithm improves the average schedule latency by 44% comparing with the list scheduling heuristics. The proposed algorithm is more stable: it is easy to observe that the proposed algorithm is much less sensitive to the choice of different local heuristics and input applications. This is evidenced by the fact that the standard deviation of the results achieved by the new algorithm is much smaller than that of the traditional list scheduler. Based on the data shown in Table 1, the average standard deviation for list scheduler over all the benchmarks and different heuristic choices is , while that for the MMAS algorithm is only In other words, we can expect to achieve much more stable scheduling results on different application DFGs regardless the choice of local heuristic. This is a great attribute desired in practice. Heterogeneous RCS – multiple types of resources (e.g. fast and normal multiplier) ILP (optimal) using CPLEX List scheduling Instruction mobility (IM), instruction depth (ID), latency weighted instruction depth (LWID), successor number (SN) Ant scheduling results using different local heuristics (Averaged over 5 runs, each run 100 iteration with 5 ants)

RCS Experimental Results
Homogenous RCS – all resources have unit delay New benchmarks (compared to last slide) too large for ILP

MMAS RCS: Results Consistently generates better results over all testing cases Up to 23.8% better than list scheduler Average 6.4%, and up to 15% better than force-directed scheduling Quantitatively closer to known optimal solutions Real life DFG examples commonly used in instruction scheduling study. Mention that we also examined MediaBench. For each of the benchmark samples, we run the proposed algorithm with different choices of local heuristics. For each choice, we perform 5 runs where in each run we allow 100 iterations. The number of ants per iteration is set to 5. The evaporation rate is configured to be The scaling parameters for global and local heuristics are set to be = = 1 and delivery rate Q = 1. The best schedule latency is reported at the end of each run and then the average value is reported as the performance for the corresponding setting. The proposed algorithm generates better results consistently over all testing cases. For some of the testing samples, it provides significant improvement on the schedule latency. The biggest saving achieved is 23%. This is obtained for the FIR2 using LWID. Comparing with Force-directed, a 6.2% average improvements, maximum 14.7%. For all the benchmarks with known optima, our algorithm improves the average schedule latency by 44% comparing with the list scheduling heuristics. The proposed algorithm is more stable: it is easy to observe that the proposed algorithm is much less sensitive to the choice of different local heuristics and input applications. This is evidenced by the fact that the standard deviation of the results achieved by the new algorithm is much smaller than that of the traditional list scheduler. Based on the data shown in Table 1, the average standard deviation for list scheduler over all the benchmarks and different heuristic choices is , while that for the MMAS algorithm is only In other words, we can expect to achieve much more stable scheduling results on different application DFGs regardless the choice of local heuristic. This is a great attribute desired in practice

MMAS TCS Formulation Idea: Combine ACO and Force Directed Scheduling
Quick FDS review Uniformly distribute the operations onto the available resources. Operation probability Distribution graph Self force: changes on DG of scheduling an operation Predecessor/successor force: implicit effects on DG Schedule an operation to a step with the minimum force

ACO Formulation for TCS
Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found trails ij indicates the favorableness of assigning instruction i to position j 1 4 S S 1 1  v1  v2  v6  v8 + v10  v1  v2 2  v3   v7 + v9 < v11  v6 2 v3 3 - v4 3 - v4  v7  v8 + v10 v9 - v5 - v5 + < v11 4 4 E vn E vn

Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found Select operation oph probabilistically Select its timestep as following: Global Heuristics: tied with the searching experience Local Heuristics: use the inverse of distribution graph, 1/qk(j) Here and β are constants

Initialize pheromone model While (termination not satisfied) Create ants Each ant finds a solution Evaluate solutions and update pheromone Report the best result found Rewarding good partial solutions based on solution quality Pheromone evaporation

Final Version of MMAS-TCS

Effectiveness of MMAS-TCS

MMAS TCS: Results MMAS TCS is more stable than FDS, especially solution highly unconstrained 258 out of 263 test cases are equal to or better than FDS results 16.4% fewer resources

Design Space Exploration
DSE challenges to the designer Ever increasing design options Closely related w/ NP-hard problems Resource allocation scheduling Conflict objectives (speed, cost, power, …) Increasing time-to-market pressure

Our Focus: Timing/Cost
Timing/Cost Tradeoffs Known application Known resource types Known operation/resource mapping Question: find the optimal timing/cost tradeoffs Most commonly faced problem Fundamental to other design considerations

Common Strategies Usually done in an ad-hoc way
Experience dependent Or Scanning the design space with Resource Constrained (RCS) or Time Constrained (TCS) scheduling What’s the problem? RCS and TCS are dual problems Can we effectively use information from one to guide the other?

Design Space Model

Key Observations A feasible configuration C covers a beam starting from (tmin, C) tmin is the RCS result for C

Design Space Model

Key Observations A feasible configuration C covers a beam starting from (tmin, C) Optimal tradeoff curve L is monotonically non-increasing as deadline increases

Design Space Model

Theorem If C is the optimal TCS result at time t1, then the RCS result t2 of C satisfies t2 <= t1. More importantly, there is no configuration C′with a smaller cost can produce an execution time within [t2, t1].

Theorem (continued)

What does it give us? It implies that we can construct L:
Starting from the rightmost t Find TCS solution C Push it to leftwards using RCS solution of C Do this iteratively (switch between TCS + RCS)

DSE Using Time/Resource Duality

Experiments Three DSE approaches FDS: Exhaustively scanning for TCS
MMAS-TCS: Exhaustively scanning for TCS MMAS-D: Proposed method leveraging duality * Scanning means that we perform TCS on each interested deadline

DSE: MMAS-D vs. FDS

Experimental Results

Algorithm Runtime

Real Design Complications
Heterogeneous mapping One operation has many implementations Different bit-width, e.g. 32-bit multiplier good for mul(24) and mul(32) Different area and delay Real technology library extremely sophisticated Hard to estimate final timing and total area Sharing depends on the cost of multiplexers Downstream tools may not generate what we expect Resource sharing, register sharing Downstream tools break components’ boundaries Logic synthesis, placement and routing

Resource Allocation and Scheduling
Scheduling Cost function? Homogeneous TCS Total number of component Heterogeneous TCS Total area of functional units FPGA designs: LUTs, slicecs, BRAMs, … ASIC design: Silicon Area Total area comes from: Functional units Register Multiplexers Interconnect

Towards Real World: Constraint Graph
A hierarchical directed graph Nodes V: operations Edges E(vi,vj,Tij): timing constraints Timing constraint Ti,j(c,o) Start time dependencies Finish time dependencies Chained dependencies

Constraint Graph: Examples
Operation b must start after Operation a Operation a starts at least two cycles after start of Operation b Operations a and b scheduled at same cycle Operation b scheduled exactly one cycle after start of Operation a

Pipelined Designs Start a new task before the prior one completed
Feedback constraints among nodes Specific initial interval Improve throughput Requires more hardware

Operation Chaining Two or more operations scheduled in the same clock cycle Faster/larger component Shorter latency Saving registers Chaining across clock edges

Speculative Execution

Problem Formulation Constraint graph Nodes V: operations
Edges E: data dependencies and timing constraints Technology library Q Area, timing Resource constraints Desired clock period: C Determine start time and the allocation of each resource type Resource constraint scheduling Timing constraint scheduling

MMAS CRAAS: Overview Start with an initial results
Using fastest components ASAP/ALAP Resolving resource conflicts Meet timing and resource constraints MMAS iteratively searches optimal solutions

MMAS CRAAS: ASAP/ALAP Iteratively ASAP/ALAP
Handle loops/feedbacks in constraint graph Check ill-posed timing constraint

MMAS CRAAS: Initial Schedule
Resource conflicts More than available resources are used in the ASAP results Pushing operations forward

MMAS CRAAS: Overview Individual ant constructs schedules
Load ASAP timing results Update mobility range, operation probability Update distribution graph Probabilistically defer operations Probabilistically select operations Schedule operations using p(i,j,k) Update ASAP/ALAP results

MMAS CRAAS: Global Heuristics
Local heuristics Favor smaller functional units and less registers for this operation Uniform probability among all compatible resources Global heuristics Favor solutions with smaller area

MMAS CRAAS: Scheduling
Defer operations from this iteration Favor operations with many options Schedule an operation Update ASAP schedules Update global heuristics

MMAS CRAAS: Results Implemented in a leading high-level synthesis framework Leverage the HDL back-ends to collect results Front-end parses C and performs optimizations Resource sharing and register sharing after scheduling The existing algorithm Based on FDS/FDLS Refined for real designs Force-directed operation deferring Re-allocate resources and iterative until area increasing Results overview 3 - 15% smaller (optimizing area) 1-4% faster (optimizing latency)

MMAS CRAAS: Results

MMAS CRAAS: Results Hard to generate good results with control-dominant designs (158, 160, and 54) Better resource allocation and sharing Existing algorithm prematurely converges Consistent with previous observations

Conclusions and Future Research
There is (was?) room for more work in fundamental algorithms; they make a difference on real designs Ivory Tower: Most academics do not tackle real world problems Constraint graph with pipelining, speculation, chaining Actual delay and area (mux, interconnect, …) Gripes: Extremely hard to validate new algorithms against old ones (e.g. no open source code for FDS!) Backend (hooks into commercial tools a la Quartus) Benchmarks?!

CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner

Similar presentations

Presentation on theme: "CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner

Similar presentations

Presentation on theme: "CSE 237D: Spring 2008 Topic #6 Professor Ryan Kastner"— Presentation transcript:

Similar presentations

About project

Feedback