ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages 408-415 Tianyi Wang, Gang Quan, Shangping.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Fast Algorithms For Hierarchical Range Histogram Constructions
Transportation Problem (TP) and Assignment Problem (AP)
LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu Toward Optimal Deployment of Communication-Intensive Cloud Applications 1.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
1 Sensor Relocation in Mobile Sensor Networks Guiling Wang, Guohong Cao, Tom La Porta, and Wensheng Zhang Department of Computer Science & Engineering.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
1 DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jacon Cong ICCAD 2004 Presented by: Wei Chen.
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Algorithms for Precomputing Constrained Widest Paths and Multicast Trees Paper by Stavroula Siachalou and Leonidas Georgiadis Presented by Jeremy Witmer.
Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Triple Patterning Aware Detailed Placement With Constrained Pattern Assignment Haitong Tian, Yuelin Du, Hongbo Zhang, Zigang Xiao, Martin D.F. Wong.
A Resource-level Parallel Approach for Global-routing-based Routing Congestion Estimation and a Method to Quantify Estimation Accuracy Wen-Hao Liu, Zhen-Yu.
A New Approach for Task Level Computational Resource Bi-Partitioning Gang Wang, Wenrui Gong, Ryan Kastner Express Lab, Dept. of ECE, University of California,
MATHEMATICS 3 Operational Analysis Štefan Berežný Applied informatics Košice
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
QoS-Aware In-Network Processing for Mission-Critical Wireless Cyber-Physical Systems Qiao Xiang Advisor: Hongwei Zhang Department of Computer Science Wayne.
Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.
Performance Evaluation of Parallel Processing. Why Performance?
Network Aware Resource Allocation in Distributed Clouds.
Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
Heterogeneity-Aware Peak Power Management for Accelerator-based Systems Heterogeneity-Aware Peak Power Management for Accelerator-Based Systems Gui-Bin.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Static Process Scheduling Section 5.2 CSc 8320 Alex De Ruiter
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Fall 2006 Jim Martin.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
1 11 Channel Assignment for Maximum Throughput in Multi-Channel Access Point Networks Xiang Luo, Raj Iyengar and Koushik Kar Rensselaer Polytechnic Institute.
1 Job Scheduling for Grid Computing on Metacomputers Keqin Li Proceedings of the 19th IEEE International Parallel and Distributed Procession Symposium.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.
On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
Bipartite Matching. Unweighted Bipartite Matching.
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Energy-Aware Scheduling for Aperiodic Tasks on Multi-core Processors Dawei Li and Jie Wu Department of Computer and Information Sciences Temple University,
A Optimal On-line Algorithm for k Servers on Trees Author : Marek Chrobak Lawrence L. Larmore 報告人:羅正偉.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Smart Hill Climbing for Agile Dynamic Mapping in Many- Core Systems Design Automation Conference(DAC), pp.1-6, May 29-June , Austin, TX, USA M. Fattah,
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
11 -1 Chapter 12 On-Line Algorithms On-Line Algorithms On-line algorithms are used to solve on-line problems. The disk scheduling problem The requests.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
BDD-based Synthesis of Reversible Logic for Large Functions Robert Wille Rolf Drechsler DAC’09 Presenter: Meng-yen Li.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Introduction | Model | Solution | Evaluation
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, Pages Tianyi Wang, Gang Quan, Shangping Ren, Meikang Qiu 曾冠維

 Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 2

 Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 3

 IC chip performance variation can cause significant discrepancies.  One major problem caused by manufacturing variations is the fabrication yield. 4

 Therefore, micro-architecture level and core level redundancies are employed to improve the fabrication yield.  According to“Exploiting micro-architectural redundancy for defect tolerance” Core-level redundancy will achieve better yield performance. 5

 Another problem caused by manufacturing variations is performance variations. 6

 How to reduce the total schedule legnth of task graph when realizing its nominal design?  Devoloping a performance metric based on the opportunity cost. 7

8

 Introduction  Preliminary  Performance evaluation  Experimental results  Conclusions 9

10 使用 Row Rippling Column Stealing algm(RRCS) 用 redundant core 取代 faulty core

 task graph G = {V,E}. V = {v1,v2,...,vk }  E = {e(i, j) = (vi,vj )| if task node vi communicates with task node vj } |vi|,represent the execution time of task node vi.  The Logical architecture denoted as, assume it consists of cores. = {,i= 0,...,r − 1; j = 0,...,c− 1 }. 11

 The nominal design of application G based on the logical architecture (denoted as N (G, ) ).  The Physical architecture is denoted as assume it consists of cores = {,i = 0,...,m− 1; j = 0,...,n− 1 }. 12

 Problem : Given an application G; a logical architecture ; the nominal design of G on, i.e. N (G, ) ; the physical architecture. 13

 Find the mapping of M M = { |i =0,...,r − 1; j =0,...,c -1; 0 ≤ x ≤ m − 1;0 ≤ y ≤ n − 1 }. such that the maximum latency to execute G based on N (G, ) is minimized. 14

 Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 15

 1. A simple workload/performance matching heuristic.  2. Opportunity cost based workload/performance mapping  3. Logical/physical topology mapping with communication awareness 16

17 Time complexity =

 While Algorithm A is fast and intuitive,it has serveral issues.  Problem1: Larger workloads don’t necessary locate on the critical path.  Problem2: Don’t take their location into consideration. 18

 The opportunity cost is the cost of any activity measured in terms of the value of the next best alternative forgone (that is not chosen).  It is the sacrifice related to the second best choice available to someone, or group, who has picked among several mutually exclusive choices. 19

20

21

 Mapping to  The task graph of this mapping is  Since the lantency of nominal design is 55,we define that the profit of the decision is = 3.33  For the rest of the alternatives to map,the best choice is to map it to,with latency of The profit is =

 Definition 1:, let its profit be denoted as let its opportunity cost denoted as Then the performance of the decision as =

 For the example, we have =1.51, =0, =1,9, =0.76 According to Definition 1, mapping the loagical core with the largest workload assignment to the fastest core doesn’t reduce the critical path lantency and thus has the lowest performance. 24

 In the wrost case, the complexity of the while loop is O(kmn), since mxn different mappings need to be checked, where k is the number of task nodes.  The while loop will be executed for rxc times  Therefor, the overall complexity of algorithm2 is O(krcmn). 25

 Neither Algorithm 1 nor Algorithm 2 takes the communication cost into consideration.  When the communication cost becomes significant, especially for many-core platforms, the qualities of the mapping results by Algorithm A and Algorithm B can be severely compromised.  we propose an iterative algorithm (shown in Algorithm 3) to improve the performance of existing mapping results with taking the communication into consideration. 26

27

 When calculating the latency for the task graph, the communication cost can be incorporated into the calculation of performance of a mapping decision.  Algorithm 3 can iteratively improve the mapping solution, until the improvement threshold(ε) defined by user can be satisfied. 28

 Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 29

 Use TGFF to randomly generate task graphs(60 nodes)  The communication of each edge and execution time of each task are randomly generated.  We assume the P &C _OC algorithm stops after 200 iterations.  Experiments were running on a Window XP/SP3 platform powered by Intel(R) Core(TM)2 Duo 2.93GHz with 3.21 GB of RAM 30

 SWPM to denote Algorithm 1,  P_Only_OC for Algorithm 2,  P&C_OC for Algorithm 3.  also compare with two previous work,i.e. RRCS algorithm, Hungarian algorithm. 31

32 A B C 1 2 3

 Performance vs. different communication/execution ratios.  Communication cost be generated within interval [a,b].  Execution time of task node be generated within interval [c,d].  C/E ratio = 33

34 2 3

35

36

 Introduction  NoC virtualization  Performance evaluation  Experimental results  Conclusions 37

 Introduce a framework to maximize the performance of the nominal design.  Heuristics based on the concept of opportunity cost.  The proposed approach can achieve up to 30% and with an average 15% of performance improvement. 38

39