Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI.

Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI Systems Design Laboratory Department of Electrical and Computer Engineering State University of New York (SUNY) Stony Brook, USA

Motivation of the Paper
Challenges for Hw/sw co-synthesis of low power systems: System-level design decisions are critical for establishing the power consumption of an implementation It is hard to decide effective latency-power consumption trade-offs without knowing in detail the used hardware resources This paper presents an integrated approach to hardware-software co-synthesis and high level synthesis (HLS) for the design of low-power embedded systems. The paper is based on the following motivation: - System level design tasks (such as hardware-software partitioning, task scheduling, communication synthesis, etc) are critical for building high-quality low-power systems. It is well known that by the time the RTL design of a system is finished, 80% of its power consumption is already committed. Also, traditional hardware/software co-synthesis methodologies rely on hardware abstraction at a high-level. For example, hardware is represented through its capabilities to execute in parallel tasks/operations in the specification. However, an abstract perspective on hardware is not practical for low-power co-synthesis as it offers vague power consumption estimations. Power consumption estimations depend very much on the hardware specifics, which are not available if hardware is modeled at an abstract level. - There for, we believe that hardware-software co-synthesis for low-power systems should be performed in a tight integration with HLS under the guidance of an accurate timing and power evaluation (estimation) mechanism. This is the topic of this paper. Our Proposed Solution Hw/sw co-synthesis and HLS have to be integrated so that efficient power saving decisions can be contemplated at the system level

Presentation Outline System Representation for Integrated Co-Synthesis
Performance Modeling for Co-Synthesis Modeling of Design Decisions for Co-Synthesis Integrated Co-Synthesis Methodology Experimental Results Conclusions and future work The presentation will cover the following aspects: System representation used for performing integrated hardware/software co-synthesis and HLS Modeling of latency and power consumption for hardware/software co-synthesis Modeling of design decisions i.e. scheduling and binding for co-synthesis Discussion of the integrated co-synthesis methodology Presentation of our experimental results to support the proposed hardware/software co-synthesis methodology We end the presentation by enumerating our conclusions and also indicating possible ways of extending the discussed work

Related Work Co-synthesis traditionally contemplates following assumptions: Hardware is abstracted (Ernst 93, Gupta 92, Yen 97) by its capacity to (1) concurrently execute operations and (2) be shared by similar operations => Power consumption minimization needs detailed perspective on Hw System functionality is described as a task graph with data dependencies, only (Dave 97, Henkel 99, Yen 97) => Control dependencies enable better quality implementations Recently, Co-synthesis for low-power embedded systems emerged: DAC 99 suggests hw-sw partitioning for low-power systems Dave et DAC 97 propose low-power co-synthesis including resource allocation, scheduling and performance estimation Traditional hw-sw co-synthesis methodologies contemplate two main assumptions: Hardware is abstracted by considering two generic properties (Ernst 93, Gupta 92, Yen 97: its capacity to (1) concurrently execute operations and to (2) be shared by similar operations. This is a valid assumption if objective is speeding-up process execution. The link between co-synthesis and HLS, however, has to be much stronger for power consumption minimization. System functionality is described as a task graph with data dependencies, only (Dave 97, Ernst 93, Gupta 92, Henkel 99, Yen 97). This implies that co-synthesis does not address any conditional behavior expressed with control dependencies. Nevertheless, handling control dependencies enables more precise performance estimations, more effective design decisions i.e. scheduling and ultimately, better quality solutions. Recent work discusses hardware-software co-synthesis for low-power embedded systems: Henkel suggests a hardware-software partitioning method for low-power systems. Partitioning is at the level of operation (instruction) clusters. After cluster scheduling, clusters with a high utilization rate (thus, with less wasted energy) are moved to hardware. Still, there is no guarantee that a hardware-software partition actually meets imposed power constraints because accurate estimations are possible only at the RTL level, hence, after HLS. A tighter integration of co-synthesis and HLS is probably needed to surmount this limitation. Dave et al propose a low-power co-synthesis method that includes allocation, scheduling and performance estimation. They assume that task graphs do not include any control dependencies. Also, authors do not consider any low-power specific design issues such as gating or temporary shut-down of unused resources.

What do we propose? Paper presents an integrated approach to hardware-software co-synthesis and HLS for design of low-power embedded systems Goal is to find the hardware-software implementation of a system, that minimizes overall power consumption while satisfying a global latency constraint Assumptions: All available hardware resources are known (I.e. general-purpose processing elements, functional units (adders, multipliers, etc)) This paper presents an integrated approach to hardware-software co-synthesis and HLS for design of low-power embedded systems. Goal is to find the hardware-software implementation of a system, that minimizes overall power consumption while satisfying a global latency constraint. All available hardware resources including general-purpose processing elements - PEs for executing software and functional units - FUs (adders, multipliers, etc) for hardware are known. .

What do we propose? Integrated method for Co-synthesis & HLS was realized using simulated annealing: Operation clusters are partitioned and scheduled to PEs (as in traditional co-synthesis) and operations are bound and scheduled to FUs (like in HLS) Low-power oriented aspects such as PE shut-down are contemplated for each solution Exploration is guided by Performance Models (PM). PMs capture the relationship between latency and power consumption and design decisions i.e. binding and scheduling Our Contribution: Integrated approach to co-synthesis and HLS permits: 1) More accurate latency and power estimations 2) Exposes RTL-level decisions for power reduction at the system level => More effective performance trade-offs during co-synthesis Integrated method co-synthesis and HLS was realized as a simulated annealing (SA) based solution-space exploration: During exploration, operation clusters are partitioned and scheduled to PEs (as in traditional co-synthesis) and operations are bound and scheduled to FUs (like in HLS). Low-power oriented aspects such as PE shut-down are then contemplated for each partitioned (bound) and scheduled solution. Exploration is guided by Performance Model (PM) evaluation. PMs are graphs that exactly capture the relationship between latency and power consumption and design decisions i.e. binding and scheduling. We believe that PM are more suggestive and precise than traditional performance modeling through metrics such as task (operation) priorities, forces, degree of concurrency, utilization rate etc. - The integrated approach to co-synthesis and HLS is an important contribution of this paper not only because it permits a more accurate latency and power estimations but also because it exposes RTL-level design decisions at the system level. As a result, more effective performance trade-offs are possible during co-synthesis.

System Representation for Integrated Co-Synthesis and HLS
Hierarchical Data and Control Dependency Graph includes: Operation nodes (ON) used for High-level synthesis Cluster nodes (CN) used for Hardware/Software Co-synthesis Data and control dependencies exist among the nodes ON and CN are annotated with: - Execution time and power consumption Hierarchical Data and Control Dependency Graph (HDCG) is the input for our integrated co-synthesis and HLS methodology. HDCG offers a dual perspective on a system: a task-level description that is used for co-synthesis an operation-level representation that is needed for HLS. HDCG is a hybrid representation with nodes of two types: Operation nodes (ON) are needed to conduct HLS activities such as operation binding and scheduling. They denote an atomic processing such as addition, multiplication, comparison, etc. Communications are described as special ONs that are not part of any CN. Cluster nodes (CN) are useful to perform hardware-software co-synthesis, including partitioning and task scheduling. CNs are formed of ONs. Both CNs and ONs are linked through data and control dependencies. Each ON and CN is annotated with its execution times and power consumptions specific to all hardware resources that can implement it. The technique suggested by Gupta et al can be used for ON characterization and the method by Tiwari et al can be applied for CN characterization. Any dependency of power consumption on input data \cite {pedram96} \cite {landman96} is addressed at this level.

System Representation for Integrated Co-Synthesis and HLS
Communication node 1 Cond1 2 - Cond1 Cluster node 6 4 3 Cond2 - Cond2 * * * * 7 Cluster node HDCG example: Note the dual perspective on the system representation: At the system level: there are CN nodes i.e. Node 2, Node 3 etc At the operation level: there are ON nodes such as the adders and multiplier nodes that form the CN node 3 Node 2 computes condition cond1, and depending on its value, only one of its out-branches is selected. For a true value, the graph traversal activates the communication between nodes 2 and 3 (green bubble) , followed by starting operations pertaining to node 3. Node 4 is executed for a false condition value ( false is indicated by - cond1). + + 8 5 Operation node 1

Performance Modeling for Co-Synthesis
Motivation: For effective co-synthesis, performances of implementations have to be accurately related to functionality and design decisions Solution: Employ Performance Models that accurately capture timing and power consumption System descriptions as HDCGs represent functionality aspects but do not indicate any timing and power characteristics of alternative implementations. However, for selecting the best solution for co-synthesis, a technique is needed for accurately relating performances to HDCG characteristics and synthesis decisions. We propose Performance Models (PM) - a representation for expressing timing and power consumption. Next we discuss Performance Models, and present their main characteristics. Rules for automatically building PMs are presented in the paper.

Modeling of Design Decisions
Performance model includes Constant part that corresponds to the Time/power characteristics of the functional nodes Data and control dependencies Variable part that reflects Design decisions I.e. scheduling, binding Values for latency and power consumption result by numerically evaluating PMs PM representation includes two parts: a constant part that presents HDCG characteristics a variable part that reflects design decisions taken during co-synthesis For example, a certain scheduling order of CNs mapped to the same PE is correspondingly described by the variable part of a PM.

Performance Modeling for Co-Synthesis
1 2 3 5 4 Start End HDCG Latency Performance Model + max T1_ex T3_ex T2_ex T5_ex T4_ex Latency The left part of the Figure shows an HDCG. Lets assume that operation nodes 2, 5 and 3 are executed in this order on a shared FU. The right part of the Figure depicts the PM of the HDCG for latency: The constant part of the PM is represented in the figure as nodes and solid edges. It depends on invariant HDCG characteristics i.e. data and control dependencies. For example, max and addition nodes are used in the PM in the right part of the Figure to express ON (CN) start and end times. Max nodes describe that an ON (CN) can not start earlier than the moment when all its predecessors were finished. Outputs of max nodes indicate the starting time of their corresponding ON (CN). Addition nodes describe that the end time Ti_e of ON opi is the sum between its start time and its execution time Ti_ex. The variable part of the PMs shows the influence of scheduling decisions for ONs 2, 3 and 5 on the resulting latency. For instance, the execution order 2, 5, 3 is represented in the figure by dotted arcs between addition nodes that calculate end times for ON (CN) and max nodes that characterize starting times. Other ON schedulings are easily captured by the PM by accordingly changing the orientation of dotted arcs. The latency of a certain implementation solution can be precisely determined by numerically evaluating its PM.

Modeling of Design Decisions for Co-Synthesis
We modeled the impact on latency and power consumption Performance Models of Data and Control dependencies ON and CN Scheduling ON Binding and CN Partitioning The paper presents the modeling of data and control dependencies, ON and CN scheduling and ON and CN binding using the Performance Models for latency and power consumption.

Integrated Co-Synthesis Methodology
HDCG + Latency Constraints Integrated co-synthesis and HLS Performance Model Generation Partitioning (binding) + Scheduling Ri Ti Ti_ex Performance Model Integrated hardware-software co-synthesis and HLS technique is presented in the Figure. Inputs to co-synthesis are an HDCG and latency constraint. The number and type of available hardware resources, including PEs and FUs, is also given. Co-synthesis problem is finding what resource executes each HDCG node and deciding the scheduling order of nodes so that power consumption is minimized while latency is below the given constraint. The method includes two steps. First, Performance Models (PM) are generated for an HDCG. Second step is the integrated co-synthesis and HLS, which was defined as an optimization algorithm. As explained in Section 3, for each CN (ON), attributes R_i (hardware resource that executes the node) and T_i (starting time of node execution) constitute unknowns for the co-synthesis process. CN partitioning to PEs and ON binding to FUs is modeled by unknowns R_i. CN and ON scheduling is described by unknowns T_i. Possible values for R_i and T_i are searched during exploration. Power consumption and latency are computed by numerically instantiating all node characteristics R_i, T_i and Ti_ex and then evaluating their PMs. Unknowns ij_stop model resource shut-down during the idle period defined by the end of node i and the start of node j.Values for these variables are established by a greedy algorithm that examines all possibilities of turning-off hardware resources Establish Resource Shut-Down Points ij_stop Latency & Power Consumption

Co-Synthesis Methodology
Partitioning and scheduling are realized as simulation annealing: Neighborhood definition: a new solution differs by the execution order of one pair of CNs/ONs or by the binding of one CN/ON Initial Solution: CNs/ONs are uniformly distributed to hardware resources and scheduled using list-scheduling Hierarchical exploration: Is emulated using different probabilities for binding (lower) and scheduling (higher) The simulated annealing (SA) algorithm for co-synthesis iteratively selects a new point from the neighborhood of the current solution: Neighborhood is the set of points that either (1) differs from the current solution by the execution order of one pair of nodes that share a hardware resource or (2) the resource binding of one node. PMs are updated for the newly selected solution and then used for numerically evaluating the resulting power consumption and latency. If the resulting solution has a better quality than the current then the new solution is unconditionally accepted. A worse solution is accepted with a higher probability at the beginning of exploration. Initial solution is obtained by uniformly distributing nodes to resources, and then scheduling nodes using list-scheduling. Critical path was the priority function for list-scheduling. Partitioning (binding) and scheduling steps are executed with different probabilities. The reason is that multiple valid schedules are possible for each resource partitioning (binding) decision. A small probability p1 is used to select a partitioning step, that moves a cluster from a PE to another PE or to hardware. A probability p2 (p2 > p1) binds an ON to another FU. The reason for p2 being greater than p1 is that multiple HLS solutions are possible for each partitioning of clusters to FUs. Finally, a probability 1 - (p1 + p2) decides a scheduling action.

Establishment of Resource Shut-Down Points
Greedy approach to identifying shut-down points of unused hardware resources based on resulting power saving and without violating timing constraints: Algorithm is called after each scheduling decision Each possible shut-down point of a resource is found by examining ON/CN schedules and identifying the idle times of a resource Shut-down points are decided based on the amount of the resulting power savings We propose a greedy algorithm for deciding what hardware resource shut-downs result in power savings and do not violate the imposed latency constraint. As discussed before, variable ij_stop models the decision to stop the hardware resource Resource(i) shared by nodes i and j between the end of node i and start of node j. These variables are affected only when a new scheduling decision is taken because then resource idle times are modified. More precisely, these variables correspond to the nodes that have their starting times changed by the scheduling decision. Hence, the algorithm for establishing resource shut-down points has to be called after each scheduling decision of the SA exploration. The algorithm considers each of the variables and checks if any power saving results by shuting-down the resource. Power savings are calculated by using PMs. Resource shut-downs that offer the highest power savings and do not violate the fixed latency constrained are retained in the final solution.

Experimental Results Experimental set-up: Considered examples
Video coding algorithm H261 4 x 4 determinant calculator Imposed latency constraints 2000 ms and 4000 ms for H261 300 ms and 1000 ms for the determinant calculator Distinct resource sets considered for each application Quality of designs was studied by comparing the results of the integrated approach with task-level co-synthesis (also using simulated annealing) Goal of our experimental part was to evaluate the effectiveness of the integrated approach for hardware-software co-synthesis and HLS as compared to traditional task-level co-synthesis. We analyzed the resulting power consumption of solutions found by proposed and task-level methods for different latency constraints and distinct hardware resources. Both loose and tight latency constraints were considered. We used two examples the video coding algorithm H261 (video) and the 4x4 determinant calculator. Both integrated and task-level co-synthesis methods were realized as simulated-annealing based exploration algorithms. Hence, we excluded the possibility that observed differences in solution quality were due to different exploration techniques.

Experimental Results Power savings for the integrated co-synthesis method range from 1.6% to 27% Power savings are on the average higher than 10% Latencies are also in general smaller for the proposed method We observed that obtained power savings decrease as latency constraint relaxes => more functionality is placed into software and power savings are less effective Experimental results show that the integrated co-synthesis and HLS approach offers significant power savings as compared to a task-level co-synthesis method. Power savings ranged from 1.6% to 27% but were, in general higher than 10%. Obtained solutions had also a smaller latency for the integrated method. The reason for the proposed integrated method performing better may be attributed to a more efficient utilization of the available hardware resources. Better hardware utilization is due to superior FU sharing that can be revealed in the integrated method. As a result, resource idle periods are shorter so that less energy is wasted. Also, more functionality can be shifted towards slower but less power consuming resources without violating the imposed latency constraints. An interesting observation was that relative power saving tends to decrease as imposed latency relaxes. This can be explained by the fact that for loose latency constraints functionality is mostly placed into software. In our experiments, power consumption was less in software than in hardware.

Conclusions We discussed integrated approach to co-synthesis and HLS for minimizing power consumption so that latency constraint is satisfied We proposed - Hierarchical Data and Control Dependency Graphs - Performance Modeling for latency and power consumption - Hierarchical exploration algorithm for co-synthesis and HLS - Greedy technique for finding resource shut-down points Experiments showed effective latency and power consumption trade-offs for the integrated co-synthesis approach We presented an integrated approach to hardware-software co-synthesis and HLS for design of low-power embedded systems. Integrated co-synthesis and HLS is accomplished by describing systems as Hierarchical Data and Control Dependency Graphs (HDCG) that include operation clusters for software and operations for hardware. Integrated approach was realized as a simulated annealing (SA) based solution-space exploration. Exploration is guided by Performance Model (PM) evaluation. PMs are graphs that express the relationship between latency and power consumption and design decisions (i.e. binding and scheduling) and HDCG characteristics (i.e. data and control dependencies). Our experiments showed that more effective performance trade-offs are possible than for task-level co-synthesis.

Future Work Extend the integrated co-synthesis methodology by including memory synthesis Consider a more detailed communication synthesis process Experiment with alternative methods for hierarchical exploration Incorporate the shut-down analysis into the combined co-synthesis & HLS methodology

Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI.

Similar presentations

Presentation on theme: "Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI.

Similar presentations

Presentation on theme: "Integrated Hardware-Software Co-Synthesis and High-Level Synthesis for Design of Embedded Systems under Power and Latency Constraints Alex Doboli VLSI."— Presentation transcript:

Similar presentations

About project

Feedback