ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages Tianyi Wang, Gang Quan, Shangping Ren, Meikang Qiu 曾冠維
Introduction Preliminary Performance evaluation Experimental results Conclusions 2
Introduction Preliminary Performance evaluation Experimental results Conclusions 3
IC chip performance variation can cause significant discrepancies. One major problem caused by manufacturing variations is the fabrication yield. 4
Therefore, micro-architecture level and core level redundancies are employed to improve the fabrication yield. According to“Exploiting micro-architectural redundancy for defect tolerance” Core-level redundancy will achieve better yield performance. 5
Another problem caused by manufacturing variations is performance variations. 6
How to reduce the total schedule legnth of task graph when realizing its nominal design? Devoloping a performance metric based on the opportunity cost. 7
8
Introduction Preliminary Performance evaluation Experimental results Conclusions 9
10 使用 Row Rippling Column Stealing algm(RRCS) 用 redundant core 取代 faulty core
task graph G = {V,E}. V = {v1,v2,...,vk } E = {e(i, j) = (vi,vj )| if task node vi communicates with task node vj } |vi|,represent the execution time of task node vi. The Logical architecture denoted as, assume it consists of cores. = {,i= 0,...,r − 1; j = 0,...,c− 1 }. 11
The nominal design of application G based on the logical architecture (denoted as N (G, ) ). The Physical architecture is denoted as assume it consists of cores = {,i = 0,...,m− 1; j = 0,...,n− 1 }. 12
Problem : Given an application G; a logical architecture ; the nominal design of G on, i.e. N (G, ) ; the physical architecture. 13
Find the mapping of M M = { |i =0,...,r − 1; j =0,...,c -1; 0 ≤ x ≤ m − 1;0 ≤ y ≤ n − 1 }. such that the maximum latency to execute G based on N (G, ) is minimized. 14
Introduction NoC virtualization Performance evaluation Experimental results Conclusions 15
1. A simple workload/performance matching heuristic. 2. Opportunity cost based workload/performance mapping 3. Logical/physical topology mapping with communication awareness 16
17 Time complexity =
While Algorithm A is fast and intuitive,it has serveral issues. Problem1: Larger workloads don’t necessary locate on the critical path. Problem2: Don’t take their location into consideration. 18
The opportunity cost is the cost of any activity measured in terms of the value of the next best alternative forgone (that is not chosen). It is the sacrifice related to the second best choice available to someone, or group, who has picked among several mutually exclusive choices. 19
20
21
Mapping to The task graph of this mapping is Since the lantency of nominal design is 55,we define that the profit of the decision is = 3.33 For the rest of the alternatives to map,the best choice is to map it to,with latency of The profit is =
Definition 1:, let its profit be denoted as let its opportunity cost denoted as Then the performance of the decision as =
For the example, we have =1.51, =0, =1,9, =0.76 According to Definition 1, mapping the loagical core with the largest workload assignment to the fastest core doesn’t reduce the critical path lantency and thus has the lowest performance. 24
In the wrost case, the complexity of the while loop is O(kmn), since mxn different mappings need to be checked, where k is the number of task nodes. The while loop will be executed for rxc times Therefor, the overall complexity of algorithm2 is O(krcmn). 25
Neither Algorithm 1 nor Algorithm 2 takes the communication cost into consideration. When the communication cost becomes significant, especially for many-core platforms, the qualities of the mapping results by Algorithm A and Algorithm B can be severely compromised. we propose an iterative algorithm (shown in Algorithm 3) to improve the performance of existing mapping results with taking the communication into consideration. 26
27
When calculating the latency for the task graph, the communication cost can be incorporated into the calculation of performance of a mapping decision. Algorithm 3 can iteratively improve the mapping solution, until the improvement threshold(ε) defined by user can be satisfied. 28
Introduction NoC virtualization Performance evaluation Experimental results Conclusions 29
Use TGFF to randomly generate task graphs(60 nodes) The communication of each edge and execution time of each task are randomly generated. We assume the P &C _OC algorithm stops after 200 iterations. Experiments were running on a Window XP/SP3 platform powered by Intel(R) Core(TM)2 Duo 2.93GHz with 3.21 GB of RAM 30
SWPM to denote Algorithm 1, P_Only_OC for Algorithm 2, P&C_OC for Algorithm 3. also compare with two previous work,i.e. RRCS algorithm, Hungarian algorithm. 31
32 A B C 1 2 3
Performance vs. different communication/execution ratios. Communication cost be generated within interval [a,b]. Execution time of task node be generated within interval [c,d]. C/E ratio = 33
34 2 3
35
36
Introduction NoC virtualization Performance evaluation Experimental results Conclusions 37
Introduce a framework to maximize the performance of the nominal design. Heuristics based on the concept of opportunity cost. The proposed approach can achieve up to 30% and with an average 15% of performance improvement. 38
39