Presentation is loading. Please wait.

Presentation is loading. Please wait.

12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration K. Zhang, R. F. DeMara, and C. A. Sharma University.

Similar presentations


Presentation on theme: "12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration K. Zhang, R. F. DeMara, and C. A. Sharma University."— Presentation transcript:

1 12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration K. Zhang, R. F. DeMara, and C. A. Sharma University of Central Florida

2 Technical Objective: Autonomous FPGA Regeneration Redundancy increases with amount of spare capacity restricted at design-time based on time required to select spare resource determined by adequacy of spares available (?) yes Regeneration weakly-related to number recovery capacity variable at recovery-time based on time required to find suitable recovery affected by multiple characteristics (+ or -) yes Overhead from Unutilized Spares weight, size, power Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation recover without outside intervention Increased availability without pre-configured spares … everyday examplespare tirecan of fix-a-flat NASA Moon, Mars, and Beyond: Realize 10’s years service life ??? Stardust: 110 FPGAs …

3 Fault Recovery Characteristics of Selected Approaches Previous Work on Fault Recovery Normalized Power Consumption (Energy per Operation): n-plex solution using n redundant devices Reconfiguration cost r Gate-Level redundancy g Updated with scan rate s on c CLBs

4 Exploiting Population Information Population contains more robust information than individualsPopulation contains more robust information than individuals  Utilize this information for robust fault detection, faster regeneration, increased diversity for adaptation Detect Failure and Isolate Faulty ResourcesDetect Failure and Isolate Faulty Resources  Detect by inconsistencies among the population  Isolate faults using outlier identification and aging Realize RegenerationRealize Regeneration  Recovery Complexity << Design Complexity utilize diverse raw material during regeneration vs. isolated re-design utilize diverse raw material during regeneration vs. isolated re-design  Temporal consensus directs search Adaptable Performance based on Online InputsAdaptable Performance based on Online Inputs  The population evolves to changing physical environment, input vectors, and target application while increasing availability

5 Procedural Flow under Consensus-Based Evaluation Initialization Partition P into sub-populations of size |P|/2 to designate physical FPGA left-half or right-half resource utilization Consensus Based Evaluation Discrepancy Operator: CL  CR Four Fitness States :  Pristine Suspect Under Repair RefurbishedRegeneration Genetic Operators recover based on Reintroduction Rate Operators only applied once then offspring returned to “service” without concern about increasing fitness

6 Consensus-Based Evaluation (CBE) Overview Uses a Relative Fitness MeasureUses a Relative Fitness Measure  Pairwise discrepancy checking yields relative fitness measure  Broad temporal consensus in the population used to determine fitness metric  Transition between Fitness States occurs in the population  Provides graceful degradation in presence of changing environments, applications and inputs, since this is a moving measure Test Inputs = Normal Inputs for Data ThroughputTest Inputs = Normal Inputs for Data Throughput  CBE does not utilizes additional functional nor resource test vectors  Potential for higher availability as regeneration is integrated with normal operation

7 States Transitions during lifetime of i th Half-Configuration Configuration Health States Discrepancy Operator Baseline Discrepancy Operator  is dyadic operator with binary output: Z(C i ) is FPGA data throughput output of configuration C i  = RS: (Hamming Distance)  = WTA: (Equivalence)

8 Selection and Repair Process Maintain Availability  Choose Pristine, Suspect, Refurbished individuals in that order Enable Regeneration  Choose Under-Repair individuals subject to Re-introduction rate ( R )

9 Fitness State Adjustment / Repair

10 Individual’s Fitness: Evaluation Window Number of Selections with Replacement Probability of Selection Containing all K items  Each individual subjected to sufficient random operational inputs for accurately assessment  For combinational logic, E W is determined on the basis of input word width  Genetic operators invoked once every E W iterations on Under-Repair individuals to avoid unnecessary modifications  EW = 600 Random run-time inputs provide a 99.5% certainty of the test being exhaustive and conclusive

11 Population Comparison: Fitness Indices Population Consensus Sliding Window  Population behavior is periodically sampled to determine current oracle value for global fitness metric  Thresholds need to be current but not updated more frequently than necessary  Updating thresholds occurs after 25% of individuals completed E W  Ensures a fast-moving relative measure for adaptability  Case study: |C|=20 individuals … |C L |=|C R |= |C|/2 E WSliding Window = 5 E W 5/20 = 25% individuals evaluated == “sufficient”

12 Integer Multiplier Case Study Automated Creation of a Population of Multipliers: –Building blocks  Half-Adder: 18 templates created  Full-Adder: 24 templates  Parallel-And : 1 template created –OR, AND, XOR, NOR, NAND and NOT functions can be assigned to a LUT –Randomly select templates for instantiation in modules –Strict Feed-Forward flow enforced –XOR function excluded from initial designs to increase design space –Average of 21 CLBs utilized for a 3bit x 3bit Multiplier –Configurations divided into two groups, each subset using exclusive resources

13 GA Parameters & Experiments Speciation  Two-point crossover between individuals from same sub-group  Crossover points chosen to prevent intra-CLB crossover  Breeding occurs exclusively among members of sub-populations  Maintains non-interfering resource use among L, R GA operators External-Module-CrossoverInternal-Module-CrossoverInternal-Module-Mutation GA parameters Population size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit  Fault Isolation Characteristics  Regenerative Experiments Demonstrate …  Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness  Elimination of additional test vectors Experiments …

14 Isolation of a single faulty individual with 1-out-of-64 impact Outliers are identified after E W iterations have elapsed Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault 3 Isolated faulty individual’s DV differs from the average DV by 3  after 1 or more observation intervals of length E W instantaneous DV (point values) for a sample individual in population and population oracles (solid lines) Sliding Window

15 Isolation of a single faulty L individual with 10-out-of-64 impact Compare with 1-out-of-64 fault impact  Expected DV of (10/64)*600 = 93.75 for faulty configuration  One isolation will be complete approx. once in every 93.75/5 = 19 Sliding Windows  Fault Isolation achieved is 100%

16 Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact Expected isolations do not occur approx. 40% of the timeExpected isolations do not occur approx. 40% of the time  Average discrepancy value of the population is higher  Outlier isolation difficult  Multiple faulty individual, Discrepancies scattered

17 Regeneration Performance Difference (vs. Hamming Distance) Evaluation Window, E w = 600 Suspect Threshold: DV S = 1-6/600=99% Repair Threshold: DV R = 1-4/600 = 99.3% Re-introduction rate: r = 0.1 Parameters Parameters : Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online.

18 Conclusion Repair ComplexityRepair Complexity  should be more tractable that Design Complexity, given diverse “spare” designs Population-Centric AssessmentPopulation-Centric Assessment  Provides adaptability and self-calibrating autonomy with a relative assessment method Run-time Fault ManagementRun-time Fault Management  Can be realized using consensus-driven assessment methods, and using information contained in the population  Integrate Detection, Isolation, Repair under a single Population-based technique


Download ppt "12-14 September 2005 Consensus-based Evaluation for Fault Isolation and On-line Evolutionary Regeneration K. Zhang, R. F. DeMara, and C. A. Sharma University."

Similar presentations


Ads by Google