04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department.

Slides:

Advertisements

Similar presentations

Ch. 12 Routing in Switched Networks

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Ch. 12 Routing in Switched Networks Routing in Packet Switched Networks Routing Algorithm Requirements –Correctness –Simplicity –Robustness--the.

Lecture 4. Topics covered in last lecture Multistage Switching (Clos Network) Architecture of Clos Network Routing in Clos Network Blocking Rearranging.

Functions and Functional Blocks

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.

E3 Calculator Revisions 2013 v1c4 Brian Horii June 22, 2012.

Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)

Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

Low Delay Marking for TCP in Wireless Ad Hoc Networks Choong-Soo Lee, Mingzhe Li Emmanuel Agu, Mark Claypool, Robert Kinicki Worcester Polytechnic Institute.

Before start… Earlier work single-path routing in sensor networks

TAP (Test Access Port) JTAG course June 2006 Avraham Pinto.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.

Ratio estimation with stratified samples Consider the agriculture stratified sample. In addition to the data of 1992, we also have data of Suppose.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Decision Making Decision-making is based on information Information is used to: Identify the fact that there is a problem in the first place Define and.

Comparators  A comparator compares two input words.  The following slide shows a simple comparator which takes two inputs, A, and B, each of length 4.

During a mains supply interruption the entire protected network is dependent on the integrity of the UPS battery as a secondary source of energy. A potential.

1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.

Electronic Devices Ninth Edition Floyd Chapter 13.

Power Electronics and Drives (Version ) Dr. Zainal Salam, UTM-JB 1 Chapter 3 DC to DC CONVERTER (CHOPPER) General Buck converter Boost converter.

Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.

HyperTransport™ Technology I/O Link Presentation by Mike Jonas.

Series-Parallel Circuits

NOBEL WP5 Meeting Munich – 14 June 2005 WP5 Cost Study Group Author:Martin Wade (BT) Lead:Andrew Lord (BT) Relative Cost Analysis of Transparent & Opaque.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Steffen/Stettler, , 4-SpanningTree.pptx 1 Computernetze 1 (CN1) 4 Spanning Tree Protokoll 802.1D-2004 Prof. Dr. Andreas Steffen Institute for.

1 Dynamic Interconnection Networks Miodrag Bolic.

Chapter 2 Risk Measurement and Metrics. Measuring the Outcomes of Uncertainty and Risk Risk is a consequence of uncertainty. Although they are connected,

ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTEMS

A.A. Grillo SCIPP-UCSC ATLAS 10-Nov Thoughts on Data Transmission US-ATLAS Upgrade R&D Meeting UCSC 10-Nov-2005 A. A. Grillo SCIPP – UCSC.

Basic Sequential Components CT101 – Computing Systems Organization.

Multi-layered Optical Network Security

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

1 © 2003, Cisco Systems, Inc. All rights reserved. CCNP 1 v3.0 Module 1 Overview of Scalable Internetworks.

Chapter 25 Capacitance.

Idaho RISE System Reliability and Designing to Reduce Failure ENGR Sept 2005.

The concept of RAID in Databases By Junaid Ali Siddiqui.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

SpaceWire Architectures Steve Parkes Space Technology Centre, University of Dundee, Scotland, UK.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

MASCON: A Single IC Solution to ATM Multi-Channel Switching With Embedded Multicasting Ali Mohammad Zareh Bidoki April 2002.

1 Dynamic RWA Connection requests arrive sequentially. Setup a lightpath when a connection request arrives and teardown the lightpath when a connection.

12/13/05 1 Progress Report: Fault-Tolerant Architectures and Testbed Development with RapidIO Sponsor: Honeywell Space, Clearwater, FL Principal Investigator:

NetQuest: A Flexible Framework for Large-Scale Network Measurement Lili Qiu University of Texas at Austin Joint work with Han Hee Song.

Tunable QoS-Aware Network Survivability Presenter : Yen Fen Kao Advisor : Yeong Sung Lin 2013 Proceedings IEEE INFOCOM.

110 February 2006 RapidIO FT Research Update: Dynamic Routing David Bueno February 10, 2006 HCS Research Laboratory Dept. of Electrical and Computer Engineering.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

127 February 2006 RapidIO FT Research Update: Adaptive Routing David Bueno February 27, 2006 HCS Research Laboratory Dept. of Electrical and Computer Engineering.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.12.1 FAULT TOLERANT SYSTEMS Part 12 - Networks.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.

Coping with Link Failures in Centralized Control Plane Architecture Maulik Desai, Thyagarajan Nandagopal.

Design of the 64-channel ASIC: update DEI - Politecnico di Bari and INFN - Sezione di Bari Meeting INSIDE, December 18, 2014, Roma Outline  Proposed solution.

Yiting Xia, T. S. Eugene Ng Rice University

Sequential Logic Design

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

HAZARD AND FRAGILITY ANALYSIS

Application-Specific Customization of Soft Processor Microarchitecture

Supplement: Decision Making

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Chapter 8: Subnetting IP Networks

Digital Fundamentals Floyd Chapter 1 Tenth Edition

ECE 352 Digital System Fundamentals

Lecture 26 Logic BIST Architectures

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department University of Florida

04/27/06 2 Motivation Gain further insight into strengths/weaknesses of proposed architectures Quantify power, size/cost, fault isolation, and fault tolerance while maintaining a fixed level of performance Provide flexible, weighted evaluation criteria that may be modified by other users to fit their needs  Avoid excessive complexity using fair heuristics to estimate power, size/cost, fault isolation, and fault tolerance

04/27/06 3 Evaluation Criteria Overview Power  Very important in nearly every embedded system  Evaluate power based on number of active ports under no-fault conditions  Conservatively assume multiplexer ports use 50% of the power of a full RapidIO switch port (much less logic needed, just need to multiplex and repeat LVDS signal) Size/cost  Consider size/cost to be determined by total number of network pins in all chips in network fabric  Most fair way to treat serial/parallel RIO pin-count considerations  Means multiplexer chips are costly due to high pin count Fault isolation  Measure of how much a fault affects other components in the system  Classic approach of fully redundant networks provides near-perfect fault isolation  Measuring fault isolation by average number of switches that must be rerouted in the event of a switch fault, assuming fault may occur in any active switch with equal likelihood Ideally, want switches to be unaware and unaffected by faults in the system Fault tolerance  Most important metric for this work  Want to calculate expected value of number of switches that may fail in a given system before performance loss greater than 5% occurs in corner turn app.  Corner turn selected due to high level of network stress and relevance in real-world signal processing applications  Failure of multiplexer devices not explicitly considered analytically, but must be discussed

04/27/06 4 FT Calculation Calculation of most entries trivial (e.g. number of network pins) FT calculation slightly more complex and explained here for completeness F = expected number of switch failures tolerated before a loss of connectivity to any endpoint or a 5% drop in performance of our corner turn application S i = probability that a system failure occurred with any number of faults up to and including n: Where: N = number of switches in the system P i = probability of a system failure after exactly i faults Eqn. for F derived from the classical definition of an expected value  Probability of system failure with a given number of faults is equal to the probability of system failure with exactly that number of faults (P i ), multiplied by the probability that the system has not previously failed with any smaller number of faults (1-S( i-1 )). Since lower scores are better in our evaluation, reciprocal of the expected number of faults is taken prior to normalization (reciprocal is not shown in Table 8).

04/27/06 5 Weights and Scoring System Weights  Power and size/cost very important to a space system and each weighted at 1.0  FT the primary focus of this work, also encompasses performance for our purposes, weighted at 2.0  Fault isolation weighted 0.5 since based on a simple metric (rerouted switches) that was only a small focus of our investigation Prior to weighting, scores for each system are normalized with the best system having a score of 1.0 (lower scores are better)  Fault isolation a special case, since fully redundant baseline has “perfect” fault isolation with 0 switches rerouted in the event of a single fault Allow data to be normalized to next best system and give baseline a score of 0

04/27/06 6 Quantitative Results and Analysis Lower normalized scores are better Total score is sum of normalized scores after weighting Most archs. had similar power consumption, with mux-based archs. having slight disadvantage due to extra powered devices Large differences in size/cost due to widely varied ways of providing FT  Serial RIO architectures have edge due to low pin-count and lack of muxes  FTC provides promising compromise between other alternatives due to number of muxes Fault isolation metric of serial and FTC solutions suffers due to additional switch reconfigs needed (rather than mux reconfigs)  Muxes in other archs. may provide additional fault isolation and are trivial to reconfigure All archs. provide better FT than baseline  Extra-switch core networks with redundant first stage may withstand nearly 4 faults  Addition of 1 core switch actually increases expected FT by more than 1 switch Overall, serial RIO-based archs scored the best (lowest), with the FTC network providing an interesting compromise for parallel solutions in terms of all factors except fault isolation CategoryPower (active ports) Size/Cost (total network pins) Fault Isolation (avg. rerouted switches) Fault Tolerance (number of switch faults) Total Score Weight RawNorm.RawNorm.RawNorm.RawNorm. Baseline Clos Network Redundant First Stage Network Redundant First Stage Network with Extra-switch Core Redundant First Stage Network (Serial RIO) Redundant First Stage Network with Extra-switch Core (Serial RIO) RapidIO Fault- Tolerant Clos Network

04/27/06 7 Supplementary Information

04/27/06 8 Summary of Basic Architectural Characteristics Active Switches Standby Switches Total Switches Active Ports per Switch Total Switch Ports Mux Count Number Switches to Reroute-1 Number Switches to Reroute-2 Baseline Clos Network Redundant First Stage Network (8:4)08 Redundant First Stage Network with Extra-switch Core (10:5)08 Redundant First Stage Network (Serial RIO) Redundant First Stage Network with Extra-switch Core (Serial RIO) RapidIO Fault- Tolerant Clos Network (4:1)58

04/27/06 9 Baseline Clos Network Non-blocking architecture supporting 32 RapidIO endpoints FT accomplished by completely duplicating network (redundant network not shown) Withstands 1 switch fault while maintaining full connectivity Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Baseline

04/27/06 10 Redundant First Stage Network Similar to baseline, but first level has switch-by-switch failover using components that multiplex 8 RapidIO links down to 4  Must consider muxes as potential point of failure Second-level FT handled by redundant-paths routing  Full connectivity maintained as long as 1 of 4 switches remains functional  Could also supplement with redundant second level using switch-by-switch failover at cost of more complex multiplexing circuitry Muxes may present single point of failure, so processor-level redundancy may be needed Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage (8:4)08

04/27/06 11 Redundant First Stage Network: Extra-Switch Core Adds additional core switch to redundant first stage network  Switch may be left inactive and used in event of fault Second-level FT handled by redundant paths routing  Requires switches with at least 9 ports in first level, 8 ports in second level  Multiplexers must be 10:5 rather than 8:4 Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: Extra-Switch Core (10:5)08

04/27/06 12 Redundant First Stage Network: No Muxes Muxes add additional complexity and may be a point of failure  May be challenging to build LVDS mux components Design requires 16-port switches in backplane, but only need 8 active ports per switch  High port-count switches will be enabled through space-qualified serial RapidIO  For future serial RIO, assume Honeywell HX5000 SerDes with GHz x 4 lanes (possible per Honeywell High-Speed Data Networking Tech. data sheet, June ’05) Roughly equivalent to 16-bit, MHz DDR parallel RIO  For this research, using parallel RIO clock rates for fair comparison Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes

04/27/06 13 Redundant First Stage Network: No Muxes + Extra-Switch Core Combines methodologies from previous two architectures shown Requires 9-port switches in first level, 16-port switches in second level  Realistically attainable using serial RIO Availability of a 32-port serial switch would greatly simplify design (1-switch network!)  Preferred FT approach would tend towards “redundant network” approach for fabrics of these sizes Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes + Extra-Switch Core

04/27/06 14 Fault-Tolerant Clos Network Architecture studied at NJIT in 1990s, adapted here for RapidIO Uses multiplexers (4:1) for more efficient redundancy in first level  Only requires 1 redundant switch for every 4 switches in first stage  Multiplexer components are no longer a potential single point of failure for connectivity of any processors Has additional switch in second level, similar to other architectures shown Requires 9-port switches in first level, 10-port switches in second level  24-endpoint version possible using only 8-port switches and 3:1 muxes Can withstand 1 first-level fault on either half of network with no loss in functionality or performance  Compromise on fully-redundant first-stage approaches in terms of FT and size/weight/cost Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Fault-Tolerant Clos Network (4:1)58