06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central.

Slides:



Advertisements
Similar presentations
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Advertisements

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 31/22alt1 Lecture 31 System Test (Lecture 22alt in the Alternative Sequence) n Definition n Functional.
Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.
Apr. 20, 2001VLSI Test: Bushnell-Agrawal/Lecture 311 Lecture 31 System Test n Definition n Functional test n Diagnostic test  Fault dictionary  Diagnostic.
Fault-Tolerant Systems Design Part 1.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Towards Self-Testing in Autonomic Computing Systems Tariq M. King, Djuradj Babich, Jonatan Alava, and Peter J. Clarke Software Testing Research Group Florida.
Committee Members: Annie S. Wu, Jooheung Lee, and Ronald F. DeMara Committee Members: Annie S. Wu, Jooheung Lee, and Ronald F. DeMara Optimizing Dynamic.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Team Morphing Architecture Reconfigurable Computational Platform for Space.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
The Architecture Design Process
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
1 Advanced Digital Design Asynchronous Design: Research Concept by A. Steininger and M. Delvai Vienna University of Technology.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Genetic Algorithm.
Finite State Machines. Binary encoded state machines –The number of flip-flops is the smallest number m such that 2 m  n, where n is the number of states.
Rawad N. Al-Haddad, Carthik A. Sharma, Ronald F. DeMara University of Central Florida Performance Evaluation of Two Allocation Schemes for Combinatorial.
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Intro. ANN & Fuzzy Systems Lecture 36 GENETIC ALGORITHM (1)
Automated Design of Custom Architecture Tulika Mitra
Reconfiguration Based Fault-Tolerant Systems Design - Survey of Approaches Jan Balach, Jan Balach, Ondřej Novák FIT, CTU in Prague MEMICS 2010.
Investigation of the Effect of Neutrality on the Evolution of Digital Circuits. Eoin O’Grady Final year Electronic and Computer Engineering Project.
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
Heng Tan Ronald Demara A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management.
An Iterative Heuristic for State Justification in Sequential Automatic Test Pattern Generation Aiman H. El-MalehSadiq M. SaitSyed Z. Shazli Department.
Ronald F. DeMara, Carthik A. Sharma University of Central Florida Self-Checking Fault Detection Self-Checking Fault Detection using Discrepancy Mirrors.
J. Christiansen, CERN - EP/MIC
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
THE TESTING APPROACH FOR FPGA LOGIC CELLS E. Bareiša, V. Jusas, K. Motiejūnas, R. Šeinauskas Kaunas University of Technology LITHUANIA EWDTW'04.
EE141 VLSI Test Principles and Architectures Ch. 9 - Memory Diagnosis & BISR - P. 1 1 Chapter 9 Memory Diagnosis and Built-In Self-Repair.
Fault-Tolerant Systems Design Part 1.
29 September 2005 Dynamic Voting Schemes to Enhance Evolutionary Repair in Reconfigurable Logic Devices C. Milliord, C. A. Sharma, and R. F. DeMara University.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
CprE 458/558: Real-Time Systems
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
Fault-Tolerant Systems Design Part 1.
12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration K. Zhang, R. F. DeMara, and C. A. Sharma University.
Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.
1 July 2005 Autonomous FPGA Fault Handling Competitive Runtime Reconfiguration Autonomous FPGA Fault Handling through Competitive Runtime Reconfiguration.
Ronald F. DeMara, Carthik A. Sharma University of Central Florida A Combinatorial Group Testing Method A Combinatorial Group Testing Method for FPGA Fault.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
(Genetic Algorithm Interface Architecture) Final Presentation CS 425 Created By: Chuck Hall Simone Connors Héctor Aybar Mike Grim.
Genetic Algorithm Dr. Md. Al-amin Bhuiyan Professor, Dept. of CSE Jahangirnagar University.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Artificial Intelligence By Mr. Ejaz CIIT Sahiwal Evolutionary Computation.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.
VLSI Testing Lecture 14: System Diagnosis
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Information Redundancy Fault Tolerant Computing
Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang
FPGA Glitch Power Analysis and Reduction
Aiman H. El-Maleh Sadiq M. Sait Syed Z. Shazli
Searching for solutions: Genetic Algorithms
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

06 December 2007 FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. DeMara University of Central Florida

Reconfigurable Hardware with Self-Healing based on SRAM FPGA platform Organic Computing (OC) biologically-inspired computing with “self-x” properties Communication networks among autonomous systems Self-x Characteristics System Property Composed of large collection of autonomous systems Self-organization Self-configuration Self-optimization Autonomous system owned sensor and actuators Self-healing Self-protection Self-explaining Context-awareness Self-synchronization Technical Objective: OC Approach: addresses system controllability with increasing complexity Example Relevance: How to achieve sustainable presence in NASA’s Moon, Mars & Beyond objective??? Reliability Availability Sustainability support long lifetime missions with multiple failure occurrences Research Focus: Sponsors: NASA: FPGA platform and Genetic Algorithm research DARPA: OC approach and SOAR Longevity Platform

Goal: Autonomous FPGA Refurbishment Redundancy increases with amount of spare capacity restricted at design-time based on time required to select spare resource determined by adequacy of spares available (?) yes Refurbishment weakly-related to number recovery capacity variable at recovery-time based on time required to find suitable recovery affected by multiple characteristics (+ or -) yes Overhead from Unutilized Spares weight, size, power Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation fix without outside intervention increase availability without carrying pre-configured spares …

Device Failure Duration: Target: Detection: Isolation: Diagnosis: Recovery: Transient: SEU Permanent: SEL, Oxide Breakdown, Electron Migration, LPD Scrubbing Device Configuration Approach: TMR BIST Processing Datapath Device Configuration Processing Datapath Evolutionary Bitwise Comparison Reload Bitstream / Invert Bit Value Ignore Discrepancy Majority Vote STARS Supplementary Testbench Cartesian Intersection Worst-case Clock Period Dilation Replicate in Spare Resource Characteristics Methods CED Duplex Output Comparison Fast Run-time Location Select Spare Resource Vigander Duplex/Triplex Output Comparison (not addressed) unnecessary Autonomous Supervisor (AS) Autonomous Element (AE) Population-based GA using Extrinsic Fitness Evaluation Evolutionary Algorithm using Intrinsic Fitness Evaluation Fault-Handling Techniques for SRAM-based FPGAs OC

Autonomous System-on-a-Chip (ASoC) Architecture Dual-layer ASoC proposed by Lipsa et al [Lipsa 05] Functional Layer Functional Elements (FEs) e.g. CPU, RAM, Network interface Autonomic Layer Autonomic Elements (AEs) Monitor Actuator Communication interface Autonomic Supervisor (AS) UCF Approach for fault coverage Functional Layer & Autonomic Layer achieved by assessing consensus among elements 1.first to realize failure detection 2.consensus provides an organic method for fitness evaluation of competing alternatives during evolution providing a self-regulating approach to fault resolution

EHW Environments Evolvable Hardware (EHW) Environments enable experimental methods to research soft computing intelligent search techniques EHW operates by repetitive reprogramming of real-world physical devices using an iterative refinement process: Genetic Algorithm Hardware in the loop or Two modes of Evolvable Hardware Extrinsic Evolution Genetic Algorithm software model Done? Build it device “design-time” refinement Simulation in the loop Intrinsic Evolution device “run-time” refinement new approach to Autonomous Repair of failed devices Deep Space Satellite: >100 FPGAs onboard hostile environment: radiation, thermal stress How to achieve reliability to avoid mission failure??? Application

Genetic Algorithms (GAs) Mechanism coarsely modeled after neo-Darwinism (natural selection + genetics) selection of parents population of candidate solutions parents offspring crossover mutation evaluate fitness of individuals replacement start Fitness function Goal reached

Genetic Mechanisms Guided trial-and-error search techniques using principles of Darwinian evolution  iterative selection, “survival of the fittest”  genetic operators -- mutation, crossover, …  implementor must define fitness function GAs frequently use strings of 1s and 0s to represent candidate solutions Genotype chromosomes of GA operation: if is better than it will have more chance to breed and influence future population Genotype changes during evolution must adhere to the Xilinx-defined format of bitstream To prevent undesirable conditions that may damage the FPGA such as a mutation which has two logic outputs tied together, a logical genotype is used for evolution and mapped to physical phenotype Logic # = functional logic index number for LUT Row/Column= physical location of LUT in FPGA Can invoke Elitism Operator (E=1, E=2 …)  guarantees monotonically increasing fitness of best individual over all generations

Loosely Coupled Solution on Xilinx Virtex II Pro & Virtex 4 The entire system operates on a 32-bit basis The Virtex 2Pro/4 is mounted on a development board which can then be interfaced with a WorkStation running Xilinx EDK and ISE.

Organic Embedded System (OES) Architecture One Dimensional Column-oriented OES based on Xilinx Virtex II Pro FPGA platform FEs and AEs reside on two distinct layers with interconnection structure between them AEs and FEs can either be realized in hardware, software, or co-design AE layer supervises functionality of FE elements while requiring no application-specific algorithms on the AE layer Observer/Controller architecture includes an AS element which had no counterpart to evaluate if the AS fault-free, so address by minimizing its complexity in proposed approach utilize Xilinx partial reconfiguration technology to manipulate relocatable bitstreams

OES AE Component Design AEs decentralize Observer/Controller functionality: Concurrent Error Detection (CED) unit collects 2 FE Outputs for discrepancy identification A Checksum for AE fault detection which are checked against Stored Checksum values Evaluator of outputs from 2 FEs against checksum and Actuator which initiates recovery phase An important architectural property is that all AE components are identical in structure despite the fact that they monitor different types of FEs. Homogeneous characteristics deliver a uniform-behavior property leveraged for consensus-based evaluation fault-handling methodology OC Concept: although AE components add an additional complexity to the design, they will ease integration of fault-handling difficulties inherent with current commercial IP cores

Consensus-Based Evaluation (CBE) Uses a Relative Fitness MeasureUses a Relative Fitness Measure  Pairwise discrepancy checking yields relative fitness measure  Broad temporal consensus in the population used to determine fitness metric  Transition between Fitness States occurs in the population  Provides graceful degradation in presence of changing environments, applications and inputs, since this is a moving measure Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput  CBE does not utilizes additional functional nor resource test vectors  Potential for higher availability as regeneration is integrated with normal operation

Genetic Operators: Mutation Mutation: Genotype chromosomes Mutation: Phenotype chromosomes original functionality is F = F1·(F3 + F4) w/ input F2 unassigned by synthesis tool mutation operator will change input F4 to unused as F = F1·(F3 + F2) shadow shows changed input and LUT contents some opportunity for input stuck-at fault or LUT content stuck-at fault. functionalities of LUTs remain undistorted while search space explored Typical Approach: bit inversion of LUT functionality Selected Approach: input interconnection of LUTs mutated Rearrange input interconnection to search unused LUT resources which occlude faulty resource

Genetic Operators: Cell Swapping Cell-Swap operation on Genotype chromosomes Cell-Swap operation on Phenotype chromosomes interchanges two distinct LUT blocks while maintaining correct logic order and functionalities in genotype exchange all LUT input interconnections, LUT content and physical 2-tuple (Col#, Row#) as well as the logic sequence

Genetic Operators: PMX Operator Partial Match Crossover (PMX) maintains crossover information as well as order information two genotype configuration streams are aligned at LUT boundary crossover site selected at random along LUT boundary this crossover point defines a left/right partition used to affect crossover through LUT-by-LUT exchange suppose crossover point at position 4 of the LUT vector: first step is to map configuration B to configuration A by exchanging the following aligned LUTs {(4,7),(5,2),(6,1),(7,5)}. Applying PMX results in two new configurations A’ and B’

Illustrative Example: Gate Level Design of OES Experiment circuit: 1-bit Full-adder Fault-free model: Duplex Fault-impact model: TMR Fault-detect model: CBE Fault recovery strategy: GA operation Experimental setup:  Hardware prototype implemented in Xilinx Virtex-II Pro FPGA  VHDL implementation  Using the GNAT library along with the MRRA framework and JTAG reconfiguration interface.

MCNC-91 Benchmark Case Studies System Availability under Multiple Faults Circuit NameCircuit FunctionInputsOutputsApproximate Gates z4ml2-bit Add7420 cm85alogic11338 cm138aLogic6817 Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors 1 = exactly one output required to detect the fault during the original CED configuration. 2 = number of the reconfigurations required, i.e. one from CED to TMR, and one back from TMR to CED Fc1 & Fe1 = correct and faulty output number of the FE during the AE repair period Fc2 & Fe2 = correct and faulty output number during the FE repair period n = number of reconfigurations of the FE β represents reconfiguration to computation time ratio

Experimental Results Redundancy for both FE (R FE ) and AE (R AE ) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors n = number of reconfigurations of the FE β represents reconfiguration to computation time ratio Fault Free arrangement: CED FEs with cold standby FE Inject a stuck-at-zero or stuck-at- one fault at one of the FE’s LUT input pins CED -> TMR to identify faulty FE or AE CBE used to resolve faulty AE

Experimental Results Redundancy for both FE (R FE ) and AE (R AE ) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors n = number of reconfigurations of the FE β represents reconfiguration to computation time ratio Fault Free arrangement: CED FEs with cold standby FE Inject a stuck-at-zero or stuck-at- one fault at one of the FE’s LUT input pins CED -> TMR to identify faulty FE or AE CBE used to resolve faulty AE

Experimental Results Redundancy for both FE (R FE ) and AE (R AE ) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors n = number of reconfigurations of the FE β represents reconfiguration to computation time ratio Fault Free arrangement: CED FEs with cold standby FE Inject a stuck-at-zero or stuck-at- one fault at one of the FE’s LUT input pins CED -> TMR to identify faulty FE or AE CBE used to resolve faulty AE

Conclusion A self-adaptation and self-healing OES architecture developed for autonomic operation without human intervention. The OES architecture is capable of handling many single fault scenarios and several multiple fault scenarios for small digital logic design. Experimental result support our design objectives during the repair phase averaged 75.05%, 82.21%, and 65.21% for the z4ml, cm85a, and cm138a circuits respectively under stated conditions. Reconfiguration time ratio ( β ) ratio is key factor limiting availability during AE repair Future work: evaluate extensions of the OES architecture addressing scalability of in terms of pipelined stages

Backup Slides On following pages …

Isolation of a single faulty individual with 1-out-of-64 impact Outliers are identified after E W iterations have elapsed Expected D.V. = (1/64)*600 = from individual impacted by fault 3 Isolated faulty individual’s DV differs from the average DV by 3  after 1 or more observation intervals of length E W instantaneous DV (point values) for a sample individual in population and population oracles (solid lines) Sliding Window

Future Work: Development Board to Self-Contained FPGA Qualitative Analysis of CRR model Number of iterations and completeness of regeneration repair Percentage of time the device remains online despite physical resource fault (availability) Hardware Resource Management Optimization of hardware profile for Xilinx Virtex II Pro Field Testing on SRAM-based FPGA in a Cubesat mission

OES Integrated FE and AE Failure Detection Procedure System Initialization  FE Initialization step  Compute Checksum step FE Fault Detection/Recovery  AE-CED fault detection  FE fault-recovery AE fault detection Phase  A fault may exist in the CED, Actuator, or Evaluator,  A fault may exist in Check Sum component, or  A fault may exist in the Stored CheckSum-LUT. Runtime inputs to FE applied to both active instance under a CED strategy. After allowing for FE inputs propagation time through the AE, the expected output will be supplied to AE-CED for the fault detection. The output of the FE is then compared in the AE-CED module and any discrepancy between the two values will indicate that a fault has occurred either of one the FE or the AE-CED itself. Further detection will be required to distinguish which of the two is faulty. If the AE component is identified as innocent and then the fault must of occurred in this output will be discarded and control will branch to a fault identification phase which will wakeup the cold standby FE and construct a temporary TMR system which can articulate the faulty FE under the new supplied external input. Furthermore, as descrived in Section 3.3, the actuator will initiate a repair cycle which may require automatic evolutionary repair of the identified faulty FE which will be set as standby- under-repair and the AE-CED will return to receive the remaining two active FEs’ inputs. The decision- making procedure causes at least one throughput- delay penalty

Previous Work Detection Characteristics of FPGA Fault-Handling Schemes … Strategy #1) Evolve redundancy into design before the anticipated failure or …   

Fault Recovery Characteristics of Selected Approaches Previous Work … Strategy #2) Evolve recovery from specific failure after (and if) it occurs or …

CRR Arrangement in SRAM FPGA Configurations in Population C = C L  C R C L = subset of left-half configurations C R = subset of right-half configurations |C L |=|C R |= |C|/2 Discrepancy Operator Baseline Discrepancy Operator  is dyadic operator with binary output: Z(C i ) is FPGA data throughput output of configuration C i Each half-configuration evaluates  using embedded checker (XNOR gate) within each individual Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair  = RS: (Hamming Distance)  = WTA: (Equivalence)

Terminology and Characteristics Pristine Pool: Pristine Pool: C P. For any C i  C, is member of C P at generation G if and only if Suspect Pool: Suspect Pool: C S. For any C i  C, is member of C S at generation G if and only if at least one of Under Repair Pool: Under Repair Pool: C U : For any C i  C, is member of C U at generation G if and only if Refurbished Pool: Refurbished Pool: C R : after Genetic Operator applied, the new generated individual is member of C R at generation G if and only if Discrepancy CountCorrectness Count E D is Discrepancy Count of C i and E C is Correctness Count of C i Length of Evaluation Fitness Window: Length of Evaluation Fitness Window: W = E D + E C Fitness Metric: Fitness Metric: f(C i ) =E C / E W

1.Initialization  Population P of functionally-identical yet physically-distinct configurations  Partition P into sub-populations that use supersets of physically-distinct resources, e.g. size |P|/2 to designate physical FPGA left-half or right-half resource utilization 2.Fitness Assessment  Discrepancy Operator  is some function of bitwise agreement between each half’s output  Four Fitness States defined for Configurations as {C P,C S,C U,C R } with transitions, respectively: Pristine Suspect Under Repair Refurbished  Fitness Evaluation Window W determines comparison interval 3.Regeneration  Genetic Operators used to recover from fault based on Reintroduction Rate  Operators only applied once then offspring returned to “service” without for concern about increasing fitness Sketch of CRR Approach Premise: Recovery Complexity << Design Complexity fitness assessment via pairwise discrepancy (temporal voting vs. pairwise discrepancy (temporal voting vs. spatial voting)

States Transitions during lifetime of i th Half-Configuration Configuration Health States

Procedural Flow under Competitive Runtime Reconfiguration Integrates all fault handling stages using EC strategy  Detects faults by the occurrence of discrepancy  Isolates faults by accumulation of discrepancies  Failure-specific refurbishment using Genetic Operators: Intra-Module-Crossover, Inter-Module-Crossover, Intra-Module-Mutation Realize online device refurbishment  Refurbished online without additional function or resource test vectors  Repair during the normal data throughput process

Fitness Evaluation Window Fitness Evaluation WindowFitness Evaluation Window : W  denotes number of iterations used to evaluate fitness before the state of an individual is determined Determination offor 3x3 multiplier Determination of W for 3x3 multiplier  6 input pins articulating 2 6 =64 possible inputs  W should be selected so that all possible inputs appear  More formally, Let rand (X) return some x i  X at random Seek W  : [  rand (X) ] = X with high probability i=1 W x K = distinct orderings of K inputs showing in D trials if D constant, can calculate P k>1 successively probability P K of K inputs showing after D trials is ratio of x K / K D

When K=64: W Determination

Integer Multiplier Case Study 3bit x 3bit unsigned multiplieresign:3bit x 3bit unsigned multiplier automated design: –Building blocks  Half-Adder: 18 templates created  Full-Adder: 24 templates  Parallel-And : 1 template created –Randomly select templates for instantiation in modules GA operators External-Module-Crossover Internal-Module-Crossover Internal-Module-Mutation GA parameters Population size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit Experimental Evaluation Xilinx Virtex II Pro on Avnet PCI board Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative FitnessObjective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness Elimination of additional test vectorsElimination of additional test vectors Temporal Assessment processTemporal Assessment process Experiments Demonstrate …

Template Fault Coverage Half-Adder Template A Half-Adder Template B Template A – Gate3 is an AND gate – Will lose correctness if a Stuck-At-Zero fault occurs in second input line of the Gate3, an AND gate Template B – Gate3 is a NOT gate and only uses the first input line – Will work correctly even if second input line is stuck at Zero or One Half-Adder Template A

Regeneration Performance Difference (vs. Hamming Distance) Evaluation Window, E w = 600 Suspect Threshold:  S = 1-6/600=99% Repair Threshold:  R = 1-4/600 = 99.3% Re-introduction rate: r = 0.1 Parameters Parameters : Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online.

Isolation of a single faulty individual with 1-out-of-64 impact Outliers are identified after W iterations elapsed E.V. = (1/64)*600 = from minimum impact faulty individual 3 Isolated individual’s f differs from the average DV by 3  after 1 or more observation intervals of length W