7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida.

7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida

2 Motivation Mission-critical Embedded Systems require high reliability and availabilityMission-critical Embedded Systems require high reliability and availability Characteristics of Operating Environment may induce hardware failures:Characteristics of Operating Environment may induce hardware failures:  Aging, Manufacturing Defects, …etc. System Reliability:System Reliability:  Fault Avoidance. “Always Possible?”… No  Design Margin. “Always Adequate?”… No  Modular Redundancy. “Always Recoverable?”…No  Fault Refurbishment. “Highly Flexible?” … Yes … but technically challenging to achieve

3 Technical Objective: Technical Objective: Autonomous FPGA Regeneration Redundancy increases with amount of spare capacity restricted at design-time based on time required to select spare resource determined by adequacy of spares available (?) yes Regeneration weakly-related to number recovery capacity variable at recovery-time based on time required to find suitable recovery affected by multiple characteristics (+ or -) yes Overhead from Unutilized Spares weight, size, power Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation recover without outside intervention Increased availability without pre-configured spares … everyday examplespare tirecan of fix-a-flat NASA Moon, Mars, and Beyond: Realize 10’s years service life ??? Reconfiguration allows new fault-handling paradigm

4 Reprogrammable Device Failure Duration: Target: Detection: Isolation: Diagnosis: Recovery: Transient: SEU Permanent: SEL, Oxide Breakdown, Electron Migration, LPD Repetitive Readback [Wells00] Device Configuration Approach: TMR (conventional spatial redundancy) BIST Processing Datapath Device Configuration Processing Datapath Evolutionary Bitwise Comparison Invert Bit Value Ignore Discrepancy Majority Vote STARS [Abramovici01] Supplementary Testbench Cartesian Intersection Worst-case Clock Period Dilation Replicate in Spare Resource Characteristics Methods CED [McCluskey04] Duplex Output Comparison Fast Run-time Location Select Spare Resource Sussex [Vigander01] Duplex Output Comparison (not addressed) unnecessary Population-based GA using Extrinsic Fitness Evaluation Evolutionary Algorithm using Intrinsic Fitness Evaluation Fault-Handling Techniques for SRAM-based FPGAs CRR

5 Contributions Strategy for Integrating all phases of Fault Handling process Strategy for Integrating all phases of Fault Handling process  detection, isolation, diagnosis and recovery work in synergy Elimination of Additional Test Vectors Elimination of Additional Test Vectors  enables detection and isolation with minimal system downtime Autonomous Group Testing techniques for FPGA devices Autonomous Group Testing techniques for FPGA devices  isolates faults in FPGA while maintaining system performance Competitive Runtime Reconfiguration Competitive Runtime Reconfiguration  leverages iterative pairwise comparison and functional regeneration to provide adaptive refurbishment with resource recycling

6 Previous Work Detection Characteristics of FPGA Fault-Handling Schemes Strategies Strategies : 1) Evolve redundancy into design before anticipated failure 2) Redesign after detection of failure 3) Combine desirable aspects of both strategies 1) + 2) …

7 Group Testing Algorithms Origin – World War II Blood testingOrigin – World War II Blood testing Problem: Test samples from millions of new recruits Problem: Test samples from millions of new recruits Solution: Test blocks of sample before testing individual samples Solution: Test blocks of sample before testing individual samples Problem DefinitionProblem Definition  Identify subset Q of defectives from set P Minimize number of tests Minimize number of tests Test v-subsets of P Test v-subsets of P Form suitable blocks Form suitable blocks

8 CRR Arrangement in SRAM FPGA Configurations in Population C = C L  C R C L = subset of left-half configurations C R = subset of right-half configurations |C L |=|C R |= |C|/2 Discrepancy Operator Baseline Discrepancy Operator  is dyadic operator with binary output: Z(C i ) is FPGA data throughput output of configuration C i Each half-configuration evaluates  using embedded checker (XNOR gate) within each individual Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair  = RS: (Hamming Distance)  = WTA: (Equivalence)

9 1.Initialization  Population P of functionally-identical yet physically-distinct configurations  Partition P into sub-populations that use supersets of physically-distinct resources, e.g. size |P|/2 to designate physical FPGA left-half or right-half resource utilization 2.Fitness Assessment  Discrepancy Operator  is some function of bitwise agreement between each half’s output  Four Fitness States defined for Configurations as {C P,C S,C U,C R } with transitions, respectively: Pristine Suspect Under Repair Refurbished  Fitness Evaluation Window W determines comparison interval 3.Regeneration  Genetic Operators used to recover from fault based on Reintroduction Rate  Operators only applied once then offspring returned to “service” without for concern about increasing fitness Sketch of CRR Approach Sketch of CRR Approach Premise: Recovery Complexity << Design Complexity fitness assessment via pairwise discrepancy (temporal voting vs. pairwise discrepancy (temporal voting vs. spatial voting)

10 FPGA Genetic Representations Chromosome Goals:  Allow all possible LUT configurations  Allow all possible CLB interconnections given constraints of routing support  Disallow illegal FPGA configurations and non-coding introns (junk DNA)  Facilitate crossover operator Bitstring representation is natural choice, though may not scale well (investigating generative reps) Representation shown here is sample specific to Xilinx Virtex FPGA CLB 0 LUT 0 LUT 1 LUT 2 LUT 3 CLB 1CLB n  LUT 0 LUT 1 LUT 2 LUT 3 LUT 0 LUT 1 LUT 2 LUT 3

11 graceful degredation via ranking of alternatives Evolutionary Computation strategies effective for more than just repair phase: continually detect, rank, and isolate faults entirely within the underlying data throughput flow Competitive Runtime Reconfiguration (CRR) no test vectors diverse alternatives working a-priori fault detection by robust consensus over time device remains online during repair no reconfiguration when fault-free fault isolation is model-free and self-calibrating completely- repaired criteria can be ignored performance readily adjustable novel fitness assessment via pairwise discrepancy without any pre-conceived oracle for correctness (emergent behavior) ConceptualInnovation checking logic part of individual hence also competes for correctness failures in population memory covered

12 Fitness Evaluation Window Fitness Evaluation WindowFitness Evaluation Window : E  denotes number of iterations used to evaluate fitness before the state of an individual is determined Determination offor 3x3 multiplier Determination of E for 3x3 multiplier  6 input pins articulating 2 6 =64 possible inputs  W should be selected so that all possible inputs appear  More formally, Let rand (X) return some x i  X at random Seek W  : [  rand (X) ] = X with high probability i=1 W x K = distinct orderings of K inputs showing in D trials if D constant, can calculate P k>1 successively probability P K of K inputs showing after D trials is ratio of x K / K D

13 When K=64: E Determination

14 Integer Multiplier Case Study 3bit x 3bit unsigned multiplieresign:3bit x 3bit unsigned multiplier automated design: –Building blocks  Half-Adder: 18 templates created  Full-Adder: 24 templates  Parallel-And : 1 template created –Randomly select templates for instantiation in modules GA operators External-Module-Crossover Internal-Module-Crossover Internal-Module-Mutation GA parameters Population size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit Experimental Evaluation Xilinx Virtex II Pro on Avnet PCI board Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative FitnessObjective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness Elimination of additional test vectorsElimination of additional test vectors Temporal Assessment processTemporal Assessment process Experiments Demonstrate …

15 Regeneration Performance Difference (vs. Hamming Distance) Evaluation Window, E w = 600 Suspect Threshold:  S = 1-6/600=99% Repair Threshold:  R = 1-4/600 = 99.3% Re-introduction rate: r = 0.1 Parameters Parameters : Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online. Exp. Number Fault Location Failure Type Correctness after Fault Total Iterations Discrepant Iterations Repair Iteration s Final Correctness Throughput (%) 1CLB3,LUT0,Input1Stuck-at-152 / 64 1.7  10 7 4.2  10 5 119464 / 6497.7 2CLB6,LUT0,Input1Stuck-at-033 / 64 8.0  10 5 1.7  10 4 4764 / 6497.9 3CLB5,LUT2,Input0Stuck-at-122 / 64 3.1  10 6 6.8  10 4 19364 / 6497.8 4CLB7,LUT2,Input0Stuck-at-038 / 64 8.1  10 6 1.8  10 5 51364 / 6497.7 5CLB9,LUT0,Input1Stuck-at-040 / 64 2.3  10 6 7.1  10 4 21964 / 6496.9 Average32.6 / 64 6.4  10 6 1.5  10 5 43364 / 6497.6 System Throughput during Regeneration for a 3x3 multiplier

16 Isolation Problem Outline Objectives  Locate faulty logic and/or interconnect resource: a single stuck-at fault model is assumed  Online Fault Isolation: device not entirely removed from service Features  Runtime Reconfiguration: FPGA resources configured dynamically  Utilize Runtime Inputs: avoid special test-vectors, improve availability Constraints  Use pre-designed configurations: defined by target application  Subsets under test have constant resource utilization range for a given isolation problem  Resource grouping influences fault articulation: resource-mapping and input vector might mask hardware faults  Do not use specialized “block designs”  Runtime reconfiguration initially limited to column-swapping  “Non-reasonable” algorithm: “tests” may be repeated without gaining new isolation information

17 Discrepancy Mirror Fault Coverage Mechanism for Checking-the-Checker (“golden element” problem) Makes checker part of configuration that competes [DeMara PDPTA-05]

18 Influence of LUT utilization Perpetually Articulating Inputs with Equiprobable Distribution Intermittently Articulating Inputs with Equiprobable Distribution expected number of pairings grows sub-linearly in number of resources utilization below 20% or above 80% implicates (or exonerates) a smaller sub-set of resources 50% utilization, the expected number of pairings for 1,000, 10,000, and 100,000 resources are 11.1, 14.9, and 17.6 at 90% utilization mean value of 258 pairings are required to isolate the faulty resource.

19 Fault Location Using Dueling The set of all competing configurations is represented by S. Set C k represents the resources utilized by configuration k. Each competing configuration k, 1 < k < |S| has a unique binary Usage Matrix Usage Matrix U k, 1 < k < p. Elements U k [i,j], 1 < i < m, 1 < j n, where m and n represent the rows and columns in the device layout respectively. Elements U k [i,j] = 1 denote the usage of resource (i, j) by C k. History Matrix The History Matrix H, with elements H[i,j] 1 < i < m, 1 < j < n, is an integer matrix used to represent the relative fitness of individual resources. H[i,j] provides instantaneous relative fitness values of resources.

20 Dueling Example 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000100000 0010000000 0000010100 0001000000 0010011000 0000100000 0010000100 0000000000 0000000000 0000000000 0001011000 0011001000 0010100000 0010010000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0001111000 0021001000 0010110100 0011010000 0010011000 0000100000 0010000100 0000000000 0000000000 H [i,j] @ t = 0 H [i,j] @ t = 2 U1U1U1U1 U2U2U2U2 H [i,j] changes after C 1 and C 2 are loaded H [i,j] changes after C 1 and C 2 are loaded U 1 and U 2 are corresponding Usage Matrices U 1 and U 2 are corresponding Usage Matrices (3,3) is identified as the faulty resource (3,3) is identified as the faulty resource

21 Isolation Progress without Halving Without Halving Initially |S| = 20,000 Resource Utilization = 40% Number of suspected faulty elements constant at 36 after 23 iterations No subsequent improvement due to lack of differentiating information between competing configurations Temporary stasis in isolation due to insufficient design diversity

22 Dueling with Modified Halving Dueling with Halving Halving works by swapping half the used columns with unused ones Halving progressively reduces the size of the set of suspected faulty elements Isolation proceeds till a single faulty element is isolated Fault isolated after 19 iterations Symptoms of stasis invoke halving procedure for fast isolation

23 Enhancing Embedded Core BIST using Group Testing BIST Structure Used for Embedded Core Testing XCVLX30 device - 32 DSP48E Cores divided into n = 8 groups 8 x 6 2x1 multiplexers are needed. 6 columns of Comparators, each Column has 8 Comparators Comparators k n (i,j), 0  i,j  3, i  j complete test for a group of 4 Flipflops FF 0 through FF 5 register comparison results for each group Fault diagnosis script processes result of each set of 6 outputs

24 Embedded Core BIST using Group Testing – Resource Utilization Faults in up to 2 BUTs in each group of 4 can be isolated Isolation is achieved without device reconfiguration in a single stage

25 Logic Element Isolation Using Autonomous Group Testing (AGT) In each stage, suspect resources S are equally shared among p stage individuals If S = S max then mutually exclusive shares are possible, else, n share = n reqd - |R| - |S| are shared

26 Equal Share Strategy

27 Fault Isolation Using FIAT Fault Insertion and Analysis Toolkit (FIAT) provides methods to modify Xilinx FPGA configurations inserts suck-at-faults at LUT inputs precludes need to edit configuration bitstream works in conjunction with Xilinx ISE software (COTS design suite)

28 AGT Experiments Experimental SetupExperimental Setup  DES-56 encryption circuit  Xilinx ISE design tools to place and route the design  Virtex II Pro FPGA device  Fault Injection and Analysis Toolkit (FIAT) Application Programmer Interfaces (APIs) to interact with the Xilinx ISE tools to inject and evaluate faults Application Programmer Interfaces (APIs) to interact with the Xilinx ISE tools to inject and evaluate faults Editing the design file rather than the configuration bitstreamsto introduce stuck-at-faults Editing the design file rather than the configuration bitstreams to introduce stuck-at-faults Editing User Constraint Files (UCF) to control resource usage Editing User Constraint Files (UCF) to control resource usage

29 AGT – Isolation Progress

30 AGT – Maintaining Goodput With p preset = 5, goodput is maintained at > 90% Since goodput remains high, the rate of fault isolation is slower, with better-performing individuals selected to maintain Goodput Fault detection latency is minimal as compared to STARs, isolation is achieved with manageable system performance degradation

31 Conclusion Graceful Performance DegradationGraceful Performance Degradation  elimination of additional test vectors  temporal assessment using aging and outlier detection  resource recycling to utilize residual functionality Population-Centric AssessmentPopulation-Centric Assessment  Provides adaptability and self-calibrating autonomy with a relative assessment method  fitness assessment using population information and competition  create a fully functional solution using partially-fit individuals Autonomous Group TestingAutonomous Group Testing  Minimal latency fault detection  Fault isolation without additional test vectors  Efficient strategies for fast fault isolation with minimal reconfiguration  Fast first-responder to faults via resource tracking Run-time Fault ManagementRun-time Fault Management  Can be realized using consensus-driven assessment methods, and using information contained in the population  Integrate Detection, Isolation, Repair under a single Population-based technique

32 Future Work Evolvable Sequential Logic CircuitsEvolvable Sequential Logic Circuits  Fitness assessment is a major challenge for large circuits Logic and Interconnect fault handlingLogic and Interconnect fault handling  Need to integrate fault handling methods for faults in logic and the interconnects  Extend group testing principles to interconnect faults Challenges in partial reconfigurationChallenges in partial reconfiguration  Need well-tested and supported APIs for runtime reconfiguration of commercial FPGAs  Open standards in partial reconfiguration will assist reliability studies  Decreased dependence on vendor-provided design tools with an open bitstream structure is essential  FIAT can be used to study fault isolation properties of different approaches, and for evaluating other group testing algorithms for fault isolation Extending AGT to other domainsExtending AGT to other domains  Group testing techniques presented here can adapted for fault tolerant nano-scale mechanism, software etc  Reliable, self-monitoring, self-adaptive organic systems are a need, with increasing design complexity and computational capabilities

33 Publications Michael Georgiopoulos, Ronald F. DeMara, Avelino J. Gonzalez, Annie S. Wu, Mansooreh Mollaghasemi, Erol Gelenbe, Marcella Kysilka, Jimmy Secretan, Carthik A. Sharma and Ayman J. Alnsour, “A Sustainable Model for Integrating Current Topics in Machine Learning Research into the Undergraduate Curriculum,” accepted to the IEEE Transactions in Education, July 2008. A. Sarvi, C. A. Sharma and R. F. DeMara, “BIST-Based Group Testing for Diagnosis of Embedded FPGA Cores,” accepted to The 2008 International Conference on Embedded Systems and Applications, Las Vegas, Nevada, USA (July 14-17, 2008). C. A. Sharma, R. F. DeMara and A. Sarvi, “Self-Healing Reconfigurable Logic using Autonomous Group Testing,” submitted to ACM Transactions on Autonomous and Adaptive Systems (TAAS) of Special Issue on Organic Computing May 2007. R. F. DeMara, K. Zhang, C. A. Sharma, “Consensus-based Evolvable Hardware for Sustainable Fault Handling,” submitted to The IEEE Transactions in Evolutionary Computation Aug 2007. R. N. Al-Haddad, C. A. Sharma, R. F. DeMara, “Performance Evaluation of Two Allocation Schemes for Combinatorial Group Testing Fault Isolation,” in Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms ERSA ‘07,, Las Vegas, Nevada, U.S.A, June 25 – 28, 2007. R. S. Oreifej, C. A. Sharma, R. F. DeMara, “Expediting GA-Based Evolution Using Group Testing Techniques for Reconfigurable Hardware,” in Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGAs (Reconfig’06), San Luis Potosi, Mexico, September 20-22, 2006, pp 106-113. [ C. A. Sharma, R. F. DeMara, “A Combinatorial Group Testing Method for FPGA Fault Location“, in Proceedings of the International Conference on Advances in Computer Science and Technology (ACST 2006), Puerto Vallarta, Mexico, January 23 - 35, 2006. C. J. Milliord, C. A. Sharma, R. F. DeMara, “Dynamic Voting Schemes to Enhance Evolutionary Repair in Reconfigurable Logic Devices,” in Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05), pp. 8.1.1 - 8.1.6, Puebla City, Mexico, September 28 - 30, 2005. K. Zhang, R. F. DeMara, C. A. Sharma, “Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration,” in Proceedings of the International Conference in Evolvable Systems (ICES’05), pp. 12 -24, Barcelona, Spain, September 12 - 14, 2005. R. F. DeMara and C. A. Sharma, “Self-Checking Fault Detection using Discrepancy Mirrors,” in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’05), pp. 311-317, Las Vegas, Nevada, U.S.A, June 27 – 30, 2005.

34 Backup Slides On following pages …

35 Isolation: Block Duelling Algorithm based on group testingmethodsAlgorithm based on group testing methods Successive intersection to assess health of resourcesSuccessive intersection to assess health of resources kU[i,j] Each configuration k has a binary Usage Matrix U k [i,j] 1  i  m and 1  j  n  m, n are the number of rows and columns of resources in the device  Elements U k [i,j] = 1 are resources used in k H [i,j] History Matrix H [i,j] 1  i  m and 1  j  n, initially all zero, exists in which :  entries represent the fitness of resources (i, j)  Information regarding the fitness of resources over time is stored A discrepant output will lead to an increase in the value of H[i,j],  U k [i,j] = 1,k  S  All elements of H, corresponding to resources used by discrepant configuration will be incremented by one.  At any point in time, H[i,j] will be a record the outcomes of competitions  m successive intersections among are performed until |S|=1

36 Isolation of a single faulty individual with 1-out-of-64 impact Outliers are identified after W iterations elapsed E.V. = (1/64)*600 = 9.375 from minimum impact faulty individual 3 Isolated individual’s f differs from the average DV by 3  after 1 or more observation intervals of length W

37 Isolation of a single faulty L individual with 10-out-of-64 impact Compare with 1-out-of-64 fault impact  E.V. of (10/64)*600 = 93.75 discrepancies for faulty configuration  One isolation will be complete approx. once in every 93.75/5 = 19 Observation Intervals  Fault Isolation demonstrated in 100% of case

38 Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact Expected isolations do not occur approximately 40% of the time  Average discrepancy value of the population is higher  Outlier isolation difficult  Multiple faulty individual, Discrepancies scattered

39 Online Dueling Evaluation ObjectiveObjective  Isolate faults by successive intersection between sets of FPGA resources used by configurations  Analyze complexity of Isolation process VariablesVariables  Total resources available Measured in number of LUTs  Number of Competing Configurations Number of initial “Seed” designs in CRR process  Degree of Articulation Some inputs may not manifest faults, even if faulty resource used by individual  Resource Utilization Factor Percentage of FPGA resources required by target application/design  Number of Iterations for Isolation Measure of complexity and time involved in isolating fault

40 For further info … EH Website http://cal.ucf.edu

41 Fast Reconfiguration for Autonomously Reprogrammable Logic MotivationMotivation –Dynamic reconfiguration required by application –Exploit architectural & performance improvements fully –Reconfiguration delay – a major performance barrier Previous WorkPrevious Work MethodologyMethodology –Multilayer Runtime Reconfiguration Architecture (MRRA) –Spatial Management Prototype DevelopmentPrototype Development –Loosely-Coupled solution –Timing Analysis –System-On-Chip solution

42 Reconfiguration Demand during CRR For a complete repair –Approximately 2,000 generations ( ) may be required –For each generation, # evaluations may be up to 100 evaluations –Yielding the Cumulative Number of Reconfigurations (CNR) up to –For each reconfiguration task Even if reconfiguration delay alone is assumed to be in the order of tens or hundreds of milliseconds  L tot >= 5.5 hours – Therefore, the total delay

43 Previous Work - Algorithm Level ApproachMethod Partial Reconfig Spatial Relocation Temporal Parallelism Area shape Run- Time Potential Limitations Hauck, Li, Schwabe Bit file compression N/ANoN/A No Full reconfiguratio n required Shirazi, Luk, Cheung Identifying common components YesNoYesN/ANo Design time work required Mak, Young Dynamic Partitioning YesNoYesN/AYes Only desirable for large designs Ganesan, Vemuri PipeliningYesNoYesN/AYes Limited pipeline depth Compton, Li, Knol, Hauck Relocation and Defragmentatio n with new FPGA architecture Yes NoRow-basedYes Special FPGA architecture required Diessel, Middendorf Schmeck, Schmidt Task Remapped and Relocated Yes NoRectangleYes Overhead for remapping calculations Herbert, Christoph, Macro Partitioning and 2D Hashing Yes RectangleYes Rigid task modeling assumptions compression method temporal method spatial method

44 Multilayer Runtime Reconfiguration Architecture (MRRA) Develop MRRA fast reconfiguration paradigm for the CRR approachDevelop MRRA fast reconfiguration paradigm for the CRR approach Validate with real hardware platform along with detailed performance analysisValidate with real hardware platform along with detailed performance analysis First general-purpose framework for a wide variety of applications requiring dynamic reconfigurationFirst general-purpose framework for a wide variety of applications requiring dynamic reconfiguration Extend existing theories on reconfigurationExtend existing theories on reconfiguration

45 Loosely Coupled Solution The entire system operates on a 32-bit basis The Virtex-II Pro is mounted on a development board which can then be interfaced with a WorkStation running Xilinx EDK and ISE.

46 Result Assessment Establish full functional framework of both prototypesEstablish full functional framework of both prototypes Communication overhead, throughput and overall speed-up analysisCommunication overhead, throughput and overall speed-up analysis  Communication overhead for SOC solution is decreased to micro or sub- micro second order Vs. milliseconds order of Loosely Coupled solution  Up to 5-fold speedup is expected compared to the Loosely Coupled solution Translation Complexity AnalysisTranslation Complexity Analysis  The quantity of information that needs to be translated to generate the reconfiguration bitstream  Simplification from file level to bit level is expected Storage Complexity AnalysisStorage Complexity Analysis –The memory space required for the run-time algorithms – Decreased memory requirement is expected due to the translation complexity improvement

47 Publications AcceptedManuscripts Accepted Manuscripts 1.R. F. DeMara and K. Zhang, “Autonomous FPGA Fault Handling through Competitive Runtime Reconfiguration,” to appear in NASA/DoD Conference on Evolvable Hardware(EH’05), Washington D.C., U.S.A., June 29 – July 1, 2005. 2.H. Tan and R. F. DeMara, “ A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management, ” to appear in International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA ’ 05), Las Vegas, Nevada, U.S.A, June 27 – 30, 2005. 3.R. F. DeMara and C. A. Sharma, “ Self-Checking Fault Detection using Discrepancy Mirrors, ” to appear in International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA ’ 05), Las Vegas, Nevada, U.S.A, June 27 – 30, 2005. SubmittedManuscripts Submitted Manuscripts 1.R. F. DeMara and K. Zhang, “Populational Fault Tolerance Analysis Under CRR Approach,” submitted to International Conference on Evolvable Systems (ICES’05), Barcelona, Sept. 12 – 14, 2005. 2.R. F. DeMara and C. A. Sharma, “ FPGA Fault Isolation and Refurbishment using Iterative Pairing, ” submitted to IFIP VLSI-SOC Conference, Perth, W. Australia, October 17 – 19, 2005. Manuscripts In-preparation 1.R. F. DeMara and K. Zhang, “Autonomous Fault Occlusion through Competitive Runtime Reconfiguration,” submission planned to IEEE Transactions on Evolutionary Computation. 2.R. F. DeMara and C. A. Sharma, “ Multilayer Dynamic Reconfiguration Supporting Heterogeneous FPGA Resource Management, ” submission planned to IEEE Design and Test of Computers. Field Testing Implementation of CRR on-board SRAM-based FPGA in a Cubesat mission

48 EHW Environments Evolvable Hardware (EHW) Environments enable experimental methods to research soft computing intelligent search techniques EHW operates by repetitive reprogramming of real-world physical devices using an iterative refinement process: Genetic Algorithm Hardware in the loop or Two modes of Evolvable Hardware Extrinsic Evolution Genetic Algorithm software model Done? Build it device “design-time” refinement Simulation in the loop Intrinsic Evolution device “run-time” refinement new approach to Autonomous Repair of failed devices Stardust Satellite: >100 FPGAs onboard hostile environment: radiation, thermal stress How to achieve reliability to avoid mission failure??? Application

49 Genetic Algorithms (GAs) Mechanism coarsely modeled after neo-Darwinism (natural selection + genetics) selection of parents population of candidate solutions parents offspring crossover mutation evaluate fitness of individuals replacement start Fitness function Goal reached

50 Genetic Mechanisms Guided trial-and-error search techniques using principles of Darwinian evolution  iterative selection, “survival of the fittest”  genetic operators -- mutation, crossover, …  implementor must define fitness function GAs frequently use strings of 1s and 0s to represent candidate solutions  if 100101 is better than 010001 it will have more chance to breed and influence future population GAs “cast a net” over entire solution space to find regions of high fitness Can invoke Elitism Operator (E=1, E=2 …)  guarantees monotonically increasing fitness of best individual over all generations

51 Commercial Applications: Nextel: frequency allocation for cellular phone networks -- $15M predicted savings in NY market Pratt & Whitney: turbine engine design --- engineer: 8 weeks; GA: 2 days w/3x improvement International Truck: production scheduling improved by 90% in 5 plants NASA: superior Jupiter trajectory optimization, antennas, FPGAs Koza: 25 instances showing human-competitive performance such as analog circuit design, amplifiers, filters GA Success Stories

52 Representing Candidate Solutions Individual(Chromosome) GENE  Representation of an individual can be using discrete values (binary, integer, or any other system with a discrete set of values)  Example of Binary DNA Encoding:

53 Genetic Operators t t + 1t + 1 mutation recombination (crossover) reproduction selection

54 Crossover Operator Population:... 1 1 1 1 1 1 10 0 0 0 0 0 0 parents cut 1 1 1 0 0 0 00 0 0 1 1 1 1 offspring

55 Procedural Flow under Competitive Runtime Reconfiguration Integrates all fault handling stages using EC strategy  Detects faults by the occurrence of discrepancy  Isolates faults by accumulation of discrepancies  Failure-specific refurbishment using Genetic Operators: Intra-Module-Crossover, Inter-Module-Crossover, Intra-Module-Mutation Realize online device refurbishment  Refurbished online without additional function or resource test vectors  Repair during the normal data throughput process

56 Template Fault Coverage Half-Adder Template A Half-Adder Template B Template A – Gate3 is an AND gate – Will lose correctness if a Stuck-At-Zero fault occurs in second input line of the Gate3, an AND gate Template B – Gate3 is a NOT gate and only uses the first input line – Will work correctly even if second input line is stuck at Zero or One Half-Adder Template A

57 Evolvable Hardware Evolutionary Design: Start with available CLBs and IOBs Implement a design using Genetic Operators etc Limited or no ability to re-design to account for suspected faulty resources Evolutionary Regeneration: Evolutionary Regeneration: Start with an existing pool of designs Some existing configurations may use faulty resources Eliminate use of suspected faulty resources Genetic Operators can be applied to refurbish designs

58 Competitive Runtime Reconfiguration (CRR) Overview Uses a Relative Fitness MeasureUses a Relative Fitness Measure  Pairwise discrepancy checking yields relative fitness measure  Broad temporal consensus in the population used to determine fitness metric  Transition between Fitness States occurs in the population  Provides graceful degradation in presence of changing environments, applications and inputs, since this is a moving measure Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput  CBE does not utilizes additional functional nor resource test vectors  Potential for higher availability as regeneration is integrated with normal operation

59 Exploiting Population Information Population contains more robust information than individualsPopulation contains more robust information than individuals  Utilize this information for robust fault detection, faster regeneration, increased diversity for adaptation Detect Failure and Isolate Faulty ResourcesDetect Failure and Isolate Faulty Resources  Detect by inconsistencies among the population  Isolate faults using outlier identification and aging Realize RegenerationRealize Regeneration  Recovery Complexity << Design Complexity utilize diverse raw material during regeneration vs. isolated re-design utilize diverse raw material during regeneration vs. isolated re-design  Temporal consensus directs search Adaptable Performance based on Online InputsAdaptable Performance based on Online Inputs  The population evolves to changing physical environment, input vectors, and target application while increasing availability

60 Selection Process

61 Fitness Adjustment Procedure

62 Discrepancy Mirror Circuit Fault Coverage ComponentFault ScenariosFault-Free Function Output AFaultCorrect Function Output BCorrectFaultCorrect XNOR A Disagree (0) Fault : Disagree(0)Agree (1) XNOR B Disagree (0) Agree (1)Fault : Disagree(0)Agree (1) Buffer A 00High-Z01 Buffer B 000High-Z1 Match Output00001

63 CGT-Pruned GA Simulator

64 Repair Progress

7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida.

Similar presentations

Presentation on theme: "7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida.

Similar presentations

Presentation on theme: "7 July 2008 Sustainable Fault-Handling of Reconfigurable Logic using Throughput-Driven Assessment Carthik Anand Sharma University of Central Florida."— Presentation transcript:

Similar presentations

About project

Feedback