Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.
THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Distance-Constraint Reachability Computation in Uncertain Graphs Ruoming Jin, Lin Liu Kent State University Bolin Ding UIUC Haixun Wang MSRA.
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Service Level Agreement based Allocation of Cluster Resources: Handling Penalty to Enhance Utility Chee Shin Yeo and Rajkumar Buyya Grid Computing and.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Soft Real-Time Semi-Partitioned Scheduling with Restricted Migrations on Uniform Heterogeneous Multiprocessors Kecheng Yang James H. Anderson Dept. of.
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.
1 of 30 June 14, 2000 Scheduling and Communication Synthesis for Distributed Real-Time Systems Paul Pop Department of Computer and Information Science.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Performance Yield-Driven Task Allocation and Scheduling for MPSoCs under Process Variation Presenter:
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
L i a b l eh kC o m p u t i n gL a b o r a t o r y On Effective and Efficient In-Field TSV Repair for Stacked 3D ICs Presenter: Li Jiang Li Jiang †, Fangming.
COST IC804 – IC805 Joint meeting, February Jorge G. Barbosa, Altino M. Sampaio, Hamid Harabnejad Universidade do Porto, Faculdade de Engenharia,
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
IE 594 : Research Methodology – Discrete Event Simulation David S. Kim Spring 2009.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
L i a b l eh kC o m p u t i n gL a b o r a t o r y On Effective TSV Repair for 3D- Stacked ICs Li Jiang †, Qiang Xu † and Bill Eklow § † CUhk REliable.
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
A Budget Constrained Scheduling of Workflow Applications on Utility Grids using Genetic Algorithms Jia Yu and Rajkumar Buyya Grid Computing and Distributed.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Yield Enhancement for 3D-Stacked Memory by Redundancy Sharing across Dies Li Jiang, Rong Ye and Qiang.
DYNAMIC TEST SET SELECTION USING IMPLICATION-BASED ON-CHIP DIAGNOSIS Nicholas Imbriglia, Nuno Alves, Elif Alpaslan, Jennifer Dworak Brown University NATW.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
Optimal resource assignment to maximize multistate network reliability for a computer network Yi-Kuei Lin, Cheng-Ta Yeh Advisor : Professor Frank Y. S.
1 Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs Lin Huang, Rong Ye and Qiang Xu CHhk REliable computing laboratory (CURE) The Chinese.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
An Energy-Efficient Hypervisor Scheduler for Asymmetric Multi- core 1 Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.
Presenter: Hong-Wei Zhuang X-Tracer: A Reconfigurable X- Tolerant Trace Compressor for Silicon Debug Feng Yuan Dept. of Comput. Sci. & Eng., Chinese Univ.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Test Architecture Design and Optimization for Three- Dimensional SoCs Li Jiang, Lin Huang and Qiang Xu CUhk Reliable Computing Laboratry Department of.
L i a b l eh kC o m p u t i n gL a b o r a t o r y Test Economics for Homogeneous Manycore Systems Lin Huang† and Qiang Xu†‡ †CUhk REliable computing laboratory.
Extended Finite-State Machine Inference with Parallel Ant Colony Based Algorithms PPSN’14 September 13, 2014 Daniil Chivilikhin PhD student ITMO.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
IMPACT OF CACHE PARTITIONING ON MULTI-TASKING REAL TIME EMBEDDED SYSTEMS Presentation by: Eric Magil Research by: Bach D. Bui, Marco Caccamo, Lui Sha,
Qiang XU CUhk REliable computing laboratory (CURE)
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
DTM and Reliability High temperature greatly degrades reliability
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Energy-Aware Scheduling for Aperiodic Tasks on Multi-core Processors Dawei Li and Jie Wu Department of Computer and Information Sciences Temple University,
Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Taniya Siddiqua, Paul Lee University of Virginia, Charlottesville.
-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.
Adaptable Approach to Estimating Thermal Effects in a Data Center Environment Corby Ziesman IMPACT Lab Arizona State University.
Optimization of Time-Partitions for Mixed-Criticality Real-Time Distributed Embedded Systems Domițian Tămaș-Selicean and Paul Pop Technical University.
OPERATING SYSTEMS CS 3502 Fall 2017
Xiaodong Wang, Shuang Chen, Jeff Setter,
Ching-Chi Lin Institute of Information Science, Academia Sinica
Nodal Methods for Core Neutron Diffusion Calculations
Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed
Distance-Constraint Reachability Computation in Uncertain Graphs
Presentation transcript:

Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department of Computer Science & Engineering The Chinese University of Hong Kong DATE’09

Lifetime Reliability of Embedded Multiprocessor Platform Multiprocessor system-on-a-chip (MPSoC) Platform-based design Hardware / software co-synthesis Reliability issue IC product wear-out  lifetime reliability threats Time dependent dielectric breakdown (TDDB), electromigration (EM), stress migration (SM), negative bias temperature instability (NBTI) Soft errors

Prior Work Prior work in reliability-driven task allocation and scheduling Constant failure rate Limitation of thermal-aware task scheduling Might improve the system’s lifetime reliability implicitly Not readily applicable, especially for heterogeneous MPSoC

Problem Motivation Example Electromigration Suppose, and all other parameters are the same P 1 ages much faster than P 2, dominating the MPSoC lifetime P1P1 P2P2 MPSoC Platform

Problem Formulation Task allocation and scheduling Output Aim: to maximize the expected service life (mean time to failure, MTTF) of the MPSoC system under the performance constraint P1P1 P2P2 MPSoC Platform T0T0 T1T1 T2T2 T3T3 T4T4 Task Graph Binding & Scheduling T0T0 P1P1 P2P2 T1T1 T2T2 T3T3 T4T4 Periodical Schedule

Lifetime Reliability Estimation Electromigration Denote by the reliability of a single processor at time Expected service life Weibull distribution Temperature Variation Example Computed by existing hard error models Reflect some important factors (e.g., architecture properties)

Main Approach – Simulated Annealing Solution representation (schedule order sequence; resource assignment sequence) For example, (0, 1, 3, 2, 4; P 2, P 2, P 2, P 1, P 1 ) Schedule order sequence: partial order defined by task graph Every solution corresponds to a feasible schedule Schedule Reconstruction T0T0 P1P1 P2P2 T1T1 T2T2 T3T3 T4T4 Periodical Schedule

Main Approach – Simulated Annealing Transforms of directed acyclic graph Expanded task graph Undirected complement graph Lemma: Given a valid schedule order, swapping adjacent nodes leads to another valid schedule order, provided there is an edge between these two nodes in the complement graph T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T4

Main Approach – Simulated Annealing Theorem: Starting from a valid schedule order we are able to reach any other valid schedule order after finite times of adjacent swapping For example T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T

Main Approach – Simulated Annealing Moves M1: Swap two adjacent nodes in both schedule order sequence and resource assignment sequence, if there is an edge between these two nodes in the complement graph M2: Swap two adjacent nodes in resource assignment sequence M3: Change the resource assignment of a task T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T4

Main Approach – Simulated Annealing Three moves are defined, so that Starting from a valid schedule order A, we are able to reach any other valid schedule order B after finite times of adjacent swapping Cost function First term guarantees a schedule meet all tasks’ deadlines Second term indicates the system lifetime Significant large

Main Approach – Simulated Annealing Key problem: Computation time Source of time overhead Run temperature simulator EVERY TIME we reach a new solution Simulator is called 3×10 5 times Every time trace the temperature variation for entire service life In range of years Accurate calculation requires fine- grained variation trace file Significant / within very short time An efficient cost computation strategy is essential ! initial temperature10 2 end temperature10 -5 cooling rate0.95 iteration10 3 SA parameters

I Revisit System Lifetime Reliability Estimation – Speedup I It will be better if we are able to compute MTTF by tracing the temperature variation of only one period

I Revisit System Lifetime Reliability Estimation – Speedup I A subdivision of time ……

I Revisit System Lifetime Reliability Estimation – Speedup I Given Aging effect in one period Property: does not vary from period to period This property enables us to trace the temperature variation of only ONE period

I Revisit System Lifetime Reliability Estimation – Speedup I The expected service life of one processor is Provided no redundant processors in the system, expected service life of entire system is

II Revisit System Lifetime Reliability Estimation – Speedup II Given Instead of computing the aging effect in every period, we propose to compute the aging effect of periods at one time

III Revisit System Lifetime Reliability Estimation – Speedup III Accurate calculation requests setting the length of time intervals as very small value Use steady temperature rather than accurate temporal temperature Temperature Variation Example Task Schedule

IV Revisit System Lifetime Reliability Estimation – Speedup IV Need to run temperature simulator every time we reach a new solution There can be at most kinds of processor usage combinations in task schedules Given = 3, = 4, we need only 255 times pre-computation, each for a steady temperature Estimate processors’ temperature for various processor usage combinations in pre-calculation phase only

IV Revisit System Lifetime Reliability Estimation – Speedup IV Time slot The set of under-used processors The power consumption of the tasks running on these processors Categorize the tasks into types according to power consumption E.g., Processor index under usage Task power consumption type

IV Revisit System Lifetime Reliability Estimation – Speedup IV Pre-calculate the steady temperature of processor in time slot The aging effect in unit time in this case is therefore The aging effect of P 1 in this schedule in a period is

Revisit System Lifetime Reliability Estimation – Summary A summary of speedup techniques Rewrite MTTF expression in terms of aging effect in one period Compute the aging effect of several periods at one time Approximate aging effect in one period based on the task changes and using steady temperature Call temperature estimation simulator in the pre-calculation phase only The time consumption of pre-calculation can be even reduced

Experimental Setup Random task graphs generated by TGFF Task numbers range from 20 to 260 Hypothetical MPSoC platforms Processor core numbers range from 2 to 8 Homogeneous / Heterogeneous Take electromigration model in [Goel-IEEEPress07] as example Note that, our model also applied to other failure mechanisms Compare our method with a thermal-aware task scheduling algorithm proposed in [Xie-JVLSISP06]

Accuracy Comparison between approximated MTTF and accurate value

Lifetime Reliability of Various Platforms with Various Task Graphs Platform Description Task Description Dead line Therm al- aware Simulated Annealing 0% DR5% DR10%DR M-PECo-PETaskEdgeMTTF Δ(%)MTTFΔ(%)MTTFΔ(%) Δ: Difference ratio between MTTF of simulated annealing and that of thermal aware DR: Deadline Relaxation

Lifetime Reliability of 8- Processor Platforms Task Description 8 Core Homogenous Platform8 Core Heterogeneous Platform Thermal Aware Simulated Annealing Thermal Aware Simulated Annealing DR (%)MTTFΔ(%)DR (%)MTTFΔ(%) Task #: 101 Edge #: 142 Deadline: 1059 MTTF: Deadline: 809 MTTF: Task #: 131 Edge #: 190 Deadline: 1227 MTTF: Deadline: 984 MTTF: Task #: 251 Edge #: 366 Deadline: 2014 MTTF: Deadline: 1693 MTTF:

Efficiency The simulated annealing process requests s of CPU time on Intel(R) Core(TM) 2 CPU 2.13GHz for each case 4 processors 49 tasks – 84s 8 processors 101 tasks – 158s The CPU time spending on pre-calculation ranges from 3s to 160s

Conclusion Technology advancement has brought with adverse impact of on lifetime reliability of MPSoC embedded systems Prior work on task allocation and scheduling does not explicitly take wearout failure into account an analytical model We propose an analytical model to estimate the lifetime reliability of multiprocessor platforms under periodical tasks a novel lifetime reliability-aware algorithm We present a novel lifetime reliability-aware algorithm based on simulated annealing technique several speedup techniques We propose several speedup techniques to simplify the design space exploration process with satisfactory solution quality Experimental results demonstrate the effectiveness

Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Thank you for your attention !