Presentation is loading. Please wait.

Presentation is loading. Please wait.

04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department.

Similar presentations


Presentation on theme: "04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department."— Presentation transcript:

1 04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department University of Florida

2 04/27/06 2 Motivation Gain further insight into strengths/weaknesses of proposed architectures Quantify power, size/cost, fault isolation, and fault tolerance while maintaining a fixed level of performance Provide flexible, weighted evaluation criteria that may be modified by other users to fit their needs  Avoid excessive complexity using fair heuristics to estimate power, size/cost, fault isolation, and fault tolerance

3 04/27/06 3 Evaluation Criteria Overview Power  Very important in nearly every embedded system  Evaluate power based on number of active ports under no-fault conditions  Conservatively assume multiplexer ports use 50% of the power of a full RapidIO switch port (much less logic needed, just need to multiplex and repeat LVDS signal) Size/cost  Consider size/cost to be determined by total number of network pins in all chips in network fabric  Most fair way to treat serial/parallel RIO pin-count considerations  Means multiplexer chips are costly due to high pin count Fault isolation  Measure of how much a fault affects other components in the system  Classic approach of fully redundant networks provides near-perfect fault isolation  Measuring fault isolation by average number of switches that must be rerouted in the event of a switch fault, assuming fault may occur in any active switch with equal likelihood Ideally, want switches to be unaware and unaffected by faults in the system Fault tolerance  Most important metric for this work  Want to calculate expected value of number of switches that may fail in a given system before performance loss greater than 5% occurs in corner turn app.  Corner turn selected due to high level of network stress and relevance in real-world signal processing applications  Failure of multiplexer devices not explicitly considered analytically, but must be discussed

4 04/27/06 4 FT Calculation Calculation of most entries trivial (e.g. number of network pins) FT calculation slightly more complex and explained here for completeness F = expected number of switch failures tolerated before a loss of connectivity to any endpoint or a 5% drop in performance of our corner turn application S i = probability that a system failure occurred with any number of faults up to and including n: Where: N = number of switches in the system P i = probability of a system failure after exactly i faults Eqn. for F derived from the classical definition of an expected value  Probability of system failure with a given number of faults is equal to the probability of system failure with exactly that number of faults (P i ), multiplied by the probability that the system has not previously failed with any smaller number of faults (1-S( i-1 )). Since lower scores are better in our evaluation, reciprocal of the expected number of faults is taken prior to normalization (reciprocal is not shown in Table 8).

5 04/27/06 5 Weights and Scoring System Weights  Power and size/cost very important to a space system and each weighted at 1.0  FT the primary focus of this work, also encompasses performance for our purposes, weighted at 2.0  Fault isolation weighted 0.5 since based on a simple metric (rerouted switches) that was only a small focus of our investigation Prior to weighting, scores for each system are normalized with the best system having a score of 1.0 (lower scores are better)  Fault isolation a special case, since fully redundant baseline has “perfect” fault isolation with 0 switches rerouted in the event of a single fault Allow data to be normalized to next best system and give baseline a score of 0

6 04/27/06 6 Quantitative Results and Analysis Lower normalized scores are better Total score is sum of normalized scores after weighting Most archs. had similar power consumption, with mux-based archs. having slight disadvantage due to extra powered devices Large differences in size/cost due to widely varied ways of providing FT  Serial RIO architectures have edge due to low pin-count and lack of muxes  FTC provides promising compromise between other alternatives due to number of muxes Fault isolation metric of serial and FTC solutions suffers due to additional switch reconfigs needed (rather than mux reconfigs)  Muxes in other archs. may provide additional fault isolation and are trivial to reconfigure All archs. provide better FT than baseline  Extra-switch core networks with redundant first stage may withstand nearly 4 faults  Addition of 1 core switch actually increases expected FT by more than 1 switch Overall, serial RIO-based archs scored the best (lowest), with the FTC network providing an interesting compromise for parallel solutions in terms of all factors except fault isolation CategoryPower (active ports) Size/Cost (total network pins) Fault Isolation (avg. rerouted switches) Fault Tolerance (number of switch faults) Total Score Weight1.0 0.52.0 RawNorm.RawNorm.RawNorm.RawNorm. Baseline Clos Network 961.076802.5002.01.987.45 Redundant First Stage Network 1281.33102403.332.671.02.371.678.51 Redundant First Stage Network with Extra-switch Core 1281.33121603.962.671.03.951.07.79 Redundant First Stage Network (Serial RIO) 961.030721.05.332.02.371.676.34 Redundant First Stage Network with Extra-switch Core (Serial RIO) 961.035841.175.332.03.951.05.17 RapidIO Fault- Tolerant Clos Network 961.072002.3462.252.891.377.20

7 04/27/06 7 Supplementary Information

8 04/27/06 8 Summary of Basic Architectural Characteristics Active Switches Standby Switches Total Switches Active Ports per Switch Total Switch Ports Mux Count Number Switches to Reroute-1 Number Switches to Reroute-2 Baseline Clos Network 12 248192000 Redundant First Stage Network 1282081608 (8:4)08 Redundant First Stage Network with Extra-switch Core 1292181848 (10:5)08 Redundant First Stage Network (Serial RIO) 128208192048 Redundant First Stage Network with Extra-switch Core (Serial RIO) 129218224048 RapidIO Fault- Tolerant Clos Network 1231581408 (4:1)58

9 04/27/06 9 Baseline Clos Network Non-blocking architecture supporting 32 RapidIO endpoints FT accomplished by completely duplicating network (redundant network not shown) Withstands 1 switch fault while maintaining full connectivity Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Baseline12 248192000

10 04/27/06 10 Redundant First Stage Network Similar to baseline, but first level has switch-by-switch failover using components that multiplex 8 RapidIO links down to 4  Must consider muxes as potential point of failure Second-level FT handled by redundant-paths routing  Full connectivity maintained as long as 1 of 4 switches remains functional  Could also supplement with redundant second level using switch-by-switch failover at cost of more complex multiplexing circuitry Muxes may present single point of failure, so processor-level redundancy may be needed Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage1282081608 (8:4)08

11 04/27/06 11 Redundant First Stage Network: Extra-Switch Core Adds additional core switch to redundant first stage network  Switch may be left inactive and used in event of fault Second-level FT handled by redundant paths routing  Requires switches with at least 9 ports in first level, 8 ports in second level  Multiplexers must be 10:5 rather than 8:4 Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: Extra-Switch Core1292181848 (10:5)08

12 04/27/06 12 Redundant First Stage Network: No Muxes Muxes add additional complexity and may be a point of failure  May be challenging to build LVDS mux components Design requires 16-port switches in backplane, but only need 8 active ports per switch  High port-count switches will be enabled through space-qualified serial RapidIO  For future serial RIO, assume Honeywell HX5000 SerDes with 3.125 GHz x 4 lanes (possible per Honeywell High-Speed Data Networking Tech. data sheet, June ’05) Roughly equivalent to 16-bit, 312.5 MHz DDR parallel RIO  For this research, using parallel RIO clock rates for fair comparison Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes128208192048

13 04/27/06 13 Redundant First Stage Network: No Muxes + Extra-Switch Core Combines methodologies from previous two architectures shown Requires 9-port switches in first level, 16-port switches in second level  Realistically attainable using serial RIO Availability of a 32-port serial switch would greatly simplify design (1-switch network!)  Preferred FT approach would tend towards “redundant network” approach for fabrics of these sizes Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes + Extra-Switch Core129218224048

14 04/27/06 14 Fault-Tolerant Clos Network Architecture studied at NJIT in 1990s, adapted here for RapidIO Uses multiplexers (4:1) for more efficient redundancy in first level  Only requires 1 redundant switch for every 4 switches in first stage  Multiplexer components are no longer a potential single point of failure for connectivity of any processors Has additional switch in second level, similar to other architectures shown Requires 9-port switches in first level, 10-port switches in second level  24-endpoint version possible using only 8-port switches and 3:1 muxes Can withstand 1 first-level fault on either half of network with no loss in functionality or performance  Compromise on fully-redundant first-stage approaches in terms of FT and size/weight/cost Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Fault-Tolerant Clos Network1231581408 (4:1)58


Download ppt "04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department."

Similar presentations


Ads by Google