Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department of Computer Science & Engineering The Chinese University of Hong Kong DATE’09
Lifetime Reliability of Embedded Multiprocessor Platform Multiprocessor system-on-a-chip (MPSoC) Platform-based design Hardware / software co-synthesis Reliability issue IC product wear-out lifetime reliability threats Time dependent dielectric breakdown (TDDB), electromigration (EM), stress migration (SM), negative bias temperature instability (NBTI) Soft errors
Prior Work Prior work in reliability-driven task allocation and scheduling Constant failure rate Limitation of thermal-aware task scheduling Might improve the system’s lifetime reliability implicitly Not readily applicable, especially for heterogeneous MPSoC
Problem Motivation Example Electromigration Suppose, and all other parameters are the same P 1 ages much faster than P 2, dominating the MPSoC lifetime P1P1 P2P2 MPSoC Platform
Problem Formulation Task allocation and scheduling Output Aim: to maximize the expected service life (mean time to failure, MTTF) of the MPSoC system under the performance constraint P1P1 P2P2 MPSoC Platform T0T0 T1T1 T2T2 T3T3 T4T4 Task Graph Binding & Scheduling T0T0 P1P1 P2P2 T1T1 T2T2 T3T3 T4T4 Periodical Schedule
Lifetime Reliability Estimation Electromigration Denote by the reliability of a single processor at time Expected service life Weibull distribution Temperature Variation Example Computed by existing hard error models Reflect some important factors (e.g., architecture properties)
Main Approach – Simulated Annealing Solution representation (schedule order sequence; resource assignment sequence) For example, (0, 1, 3, 2, 4; P 2, P 2, P 2, P 1, P 1 ) Schedule order sequence: partial order defined by task graph Every solution corresponds to a feasible schedule Schedule Reconstruction T0T0 P1P1 P2P2 T1T1 T2T2 T3T3 T4T4 Periodical Schedule
Main Approach – Simulated Annealing Transforms of directed acyclic graph Expanded task graph Undirected complement graph Lemma: Given a valid schedule order, swapping adjacent nodes leads to another valid schedule order, provided there is an edge between these two nodes in the complement graph T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T4
Main Approach – Simulated Annealing Theorem: Starting from a valid schedule order we are able to reach any other valid schedule order after finite times of adjacent swapping For example T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T
Main Approach – Simulated Annealing Moves M1: Swap two adjacent nodes in both schedule order sequence and resource assignment sequence, if there is an edge between these two nodes in the complement graph M2: Swap two adjacent nodes in resource assignment sequence M3: Change the resource assignment of a task T0T0 T1T1 Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Expanded Task Graph T2T2 T3T3 T4T4 T0T0 T1T1 Complement Graph T2T2 T3T3 T4T4
Main Approach – Simulated Annealing Three moves are defined, so that Starting from a valid schedule order A, we are able to reach any other valid schedule order B after finite times of adjacent swapping Cost function First term guarantees a schedule meet all tasks’ deadlines Second term indicates the system lifetime Significant large
Main Approach – Simulated Annealing Key problem: Computation time Source of time overhead Run temperature simulator EVERY TIME we reach a new solution Simulator is called 3×10 5 times Every time trace the temperature variation for entire service life In range of years Accurate calculation requires fine- grained variation trace file Significant / within very short time An efficient cost computation strategy is essential ! initial temperature10 2 end temperature10 -5 cooling rate0.95 iteration10 3 SA parameters
I Revisit System Lifetime Reliability Estimation – Speedup I It will be better if we are able to compute MTTF by tracing the temperature variation of only one period
I Revisit System Lifetime Reliability Estimation – Speedup I A subdivision of time ……
I Revisit System Lifetime Reliability Estimation – Speedup I Given Aging effect in one period Property: does not vary from period to period This property enables us to trace the temperature variation of only ONE period
I Revisit System Lifetime Reliability Estimation – Speedup I The expected service life of one processor is Provided no redundant processors in the system, expected service life of entire system is
II Revisit System Lifetime Reliability Estimation – Speedup II Given Instead of computing the aging effect in every period, we propose to compute the aging effect of periods at one time
III Revisit System Lifetime Reliability Estimation – Speedup III Accurate calculation requests setting the length of time intervals as very small value Use steady temperature rather than accurate temporal temperature Temperature Variation Example Task Schedule
IV Revisit System Lifetime Reliability Estimation – Speedup IV Need to run temperature simulator every time we reach a new solution There can be at most kinds of processor usage combinations in task schedules Given = 3, = 4, we need only 255 times pre-computation, each for a steady temperature Estimate processors’ temperature for various processor usage combinations in pre-calculation phase only
IV Revisit System Lifetime Reliability Estimation – Speedup IV Time slot The set of under-used processors The power consumption of the tasks running on these processors Categorize the tasks into types according to power consumption E.g., Processor index under usage Task power consumption type
IV Revisit System Lifetime Reliability Estimation – Speedup IV Pre-calculate the steady temperature of processor in time slot The aging effect in unit time in this case is therefore The aging effect of P 1 in this schedule in a period is
Revisit System Lifetime Reliability Estimation – Summary A summary of speedup techniques Rewrite MTTF expression in terms of aging effect in one period Compute the aging effect of several periods at one time Approximate aging effect in one period based on the task changes and using steady temperature Call temperature estimation simulator in the pre-calculation phase only The time consumption of pre-calculation can be even reduced
Experimental Setup Random task graphs generated by TGFF Task numbers range from 20 to 260 Hypothetical MPSoC platforms Processor core numbers range from 2 to 8 Homogeneous / Heterogeneous Take electromigration model in [Goel-IEEEPress07] as example Note that, our model also applied to other failure mechanisms Compare our method with a thermal-aware task scheduling algorithm proposed in [Xie-JVLSISP06]
Accuracy Comparison between approximated MTTF and accurate value
Lifetime Reliability of Various Platforms with Various Task Graphs Platform Description Task Description Dead line Therm al- aware Simulated Annealing 0% DR5% DR10%DR M-PECo-PETaskEdgeMTTF Δ(%)MTTFΔ(%)MTTFΔ(%) Δ: Difference ratio between MTTF of simulated annealing and that of thermal aware DR: Deadline Relaxation
Lifetime Reliability of 8- Processor Platforms Task Description 8 Core Homogenous Platform8 Core Heterogeneous Platform Thermal Aware Simulated Annealing Thermal Aware Simulated Annealing DR (%)MTTFΔ(%)DR (%)MTTFΔ(%) Task #: 101 Edge #: 142 Deadline: 1059 MTTF: Deadline: 809 MTTF: Task #: 131 Edge #: 190 Deadline: 1227 MTTF: Deadline: 984 MTTF: Task #: 251 Edge #: 366 Deadline: 2014 MTTF: Deadline: 1693 MTTF:
Efficiency The simulated annealing process requests s of CPU time on Intel(R) Core(TM) 2 CPU 2.13GHz for each case 4 processors 49 tasks – 84s 8 processors 101 tasks – 158s The CPU time spending on pre-calculation ranges from 3s to 160s
Conclusion Technology advancement has brought with adverse impact of on lifetime reliability of MPSoC embedded systems Prior work on task allocation and scheduling does not explicitly take wearout failure into account an analytical model We propose an analytical model to estimate the lifetime reliability of multiprocessor platforms under periodical tasks a novel lifetime reliability-aware algorithm We present a novel lifetime reliability-aware algorithm based on simulated annealing technique several speedup techniques We propose several speedup techniques to simplify the design space exploration process with satisfactory solution quality Experimental results demonstrate the effectiveness
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Thank you for your attention !