12/13/05 1 Progress Report: Fault-Tolerant Architectures and Testbed Development with RapidIO Sponsor: Honeywell Space, Clearwater, FL Principal Investigator:

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Lecture 4. Topics covered in last lecture Multistage Switching (Clos Network) Architecture of Clos Network Routing in Clos Network Blocking Rearranging.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Super Fast Camera System Performed by: Tokman Niv Levenbroun Guy Supervised by: Leonid Boudniak.
Operating System - Overview Lecture 2. OPERATING SYSTEM STRUCTURES Main componants of an O/S Process Management Main Memory Management File Management.
Chapter 13 Embedded Systems
Chapter 1 and 2 Computer System and Operating System Overview
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Multiprocessor Cache Coherency
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Synchronization and Communication in the T3E Multiprocessor.
CHAPTER 9: Input / Output
MICROPROCESSOR INPUT/OUTPUT
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.
Data and Computer Communications Circuit Switching and Packet Switching.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Computer Organization & Assembly Language © by DR. M. Amer.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Chapter 4 MARIE: An Introduction to a Simple Computer.
1 RapidIO Testbed Update Chris Conger Honeywell Project 1/25/2004.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
The concept of RAID in Databases By Junaid Ali Siddiqui.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Processor Architecture
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 3/3/04 Update Virtual Prototyping of Advanced Space System Architectures based on RapidIO Principal Investigator: Dr. Alan D. George OPS Graduate Assistants:
06/19/06 1 Simulation Case Studies Update David Bueno June 19, 2006 HCS Research Laboratory, ECE Department University of Florida.
04/27/06 1 Quantitative Analysis of Fault-Tolerant RapidIO- based Network Architectures David Bueno April 27, 2006 HCS Research Laboratory, ECE Department.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
IT3002 Computer Architecture
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
110 February 2006 RapidIO FT Research Update: Dynamic Routing David Bueno February 10, 2006 HCS Research Laboratory Dept. of Electrical and Computer Engineering.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
بسم الله الرحمن الرحيم MEMORY AND I/O.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Overview Parallel Processing Pipelining
Wavelet “Block-Processing” for Reduced Memory Transfers
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 13: I/O Systems.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

12/13/05 1 Progress Report: Fault-Tolerant Architectures and Testbed Development with RapidIO Sponsor: Honeywell Space, Clearwater, FL Principal Investigator: Dr. Alan D. George Graduate Research Assistants: David Bueno, Chris Conger Other Support: Ian Troxel HCS Research Laboratory, ECE Department University of Florida

12/13/05 2 Presentation Outline  RapidIO Experimental Testbed  RapidIO Testbed Research Overview  Processing Node Architecture  Application Details  Constant False-Alarm Rate (CFAR)  Pulse Compression  Definition of Experiments  Experimental Results  Analysis  Conclusions  Future Work  Fault-Tolerant RapidIO Architectures  Fault-Tolerant RapidIO Research Overview  Benchmark Kernel Summary  Fault-Tolerant Topologies  Experimental Results  Analytical Results  Conclusions  Future Work

12/13/05 3 RapidIO Experimental Testbed Research

12/13/05 4 RapidIO Testbed Research Overview Designed and built a new, realistic processing node for testbed  Intended to imitate our understanding of Honeywell’s processing node architecture  New processing node named HCS-CNF, or HCS Lab Compute-Node FPGA Highlight critical design features, benchmark performance of memory transfers, and comment on results  Explain overheads associated with architecture  Indicate potential optimizations/improvements Design and analysis of application kernels over RapidIO  CFAR  Pulse compression  Processing performance vs. varied data locality Address status of switch integration Variety of immediate and long-term future work

12/13/05 5 HCS-CNF Top-level Architecture Data Path Conceptual Diagram Built using Virtex-II Pro FPGAs of testbed Each HCS-CNF includes:  Processing element* w/ 16 KB internal SRAM  External 8 MB SDRAM storage device  RapidIO endpoint Implements DMA, transparent remote memory access for processing element* * Processing element can be PowerPC and/or FPGA fabric Arbitrates access to SDRAM storage, no “starvation” allowed  Automatically switch between servicing locally and remotely initiated memory transfers  Equally prioritized, 256 bytes per burst (if both local and remote transactions are requesting bus) Interface 64-bit data path of endpoint with 32-bit data path of SDRAM device Maintain minimum theoretical throughput of 4 Gbps at all points

12/13/05 6 Processing Element (PE) Options Processing performed by computation in FPGA logic, or embedded PowerPC Interface for both options is identical  DMA-style interface, PE provides (source address, destination address, transfer size)  Operations carried out transparently, interrupt signal set when transfer completes  Only one transfer at a time Main processing intended to be performed directly from internal SRAM  Data transferred from local SDRAM, or from remote SDRAM through RapidIO directly into local SRAM for processing  Processed data placed back into SRAM (storage requirements?)  Finished data can be: Written to local SDRAM to be read by remote endpoint Written directly to remote SDRAM without explicit request by remote endpoint  Applications address physical memory locations directly, in absence of OS MEMORY MAP

12/13/05 7 Memory Sub-System Architecture Signal-Level Block Diagram of Endpoint Controller

12/13/05 8 Application Details: HW-CFAR Internal SRAM partitioned into two sections:  Output buffer (8 kB)  Input buffer (8 kB) State machine (not shown) initiates DMAs, controls processing  Transfer data into SRAM (DMA from local or remote storage)  Perform detection, stream results into output buffer (MSB set to 1 for target)  Transfer data back to local or remote storage Each cell is compared to scaled average of surrounding cells (immediately adjacent cells ignored, called guard cells) Parameters selected for this implementation:  Window size:41 elements  # guard cells:8 (4 either side)  # averaging elements32 (16 either side)  Does not perform magnituding CFAR Processor Architecture CFAR Sliding Window and Cell Definition

12/13/05 9 Application Details: SW-CFAR Instead of using a custom logic module to perform CFAR in hardware, implement CFAR processing in software  Embedded PowerPC used to initiate DMA transfers, process data out of SRAM module  One port of SRAM is connected internally to processor local bus (PLB) of PPC, other port brought out to external module port (for memory controller)  Identical interface to memory controller as hardware processing element  In fact, a simple `define statement in Verilog source synthesizes either hardware PE (shown on last slide) or PPC/SRAM module PowerPC manipulates DMA engine configuration/control registers through software library calls, read/write to SRAM directly (via pointers)  Each DMA engine register has specific memory address (memory-mapped registers), SRAM is assigned address range  Establish pointer in C to each register address, as well as create two 8 kB buffers (e.g. unsigned int inbuff[2048], outbuff[2048] ) with base addresses pointing to proper place in SRAM address range Compare software processing performance with hardware module PPC Processing Element Architecture

12/13/05 10 Application Details: Pulse Compression (PC) Pulse compression consists of:  1024-pt. FFT  Point-by-point vector multiplication  1024-pt. IFFT Heart of PC processor is Xilinx FFT core Conceptual design is identical to HW CFAR processor  DMA data into SRAM  Perform pulse compression on data, write results back to SRAM  DMA data back to SDRAM  Repeat until entire data set has been processed Two variations of Xilinx FFT core are available, streaming and ‘store-n-process’  Since pulse compression also requires vector multiply followed by inverse FFT, selected store- n-process model  Slightly lower FFT performance, but… Selected variation supports double buffering Most practical PC implementation considering CNF memory design Once data is in SRAM, processing proceeds as follows:  Load data into FFT core  Perform FFT  Unload data from core, perform multiplication on each element as it is unloaded, then loop back around to reload scaled value into core for IFFT  Perform IFFT  Unload data from core, place back into SRAM PC Processor Architecture

12/13/05 11 Definition of Experiments Two distinct sets of experiments performed with new HCS-CNF  Baseline performance benchmarking (throughput, latency)  Application case studies (CFAR, pulse compression) Baseline performance benchmarking  Local read/write performance (no RapidIO, HCS-CNF internal throughput)  Remote read/write performance (using NREAD, SWRITE types) To/from SDRAM To/from SRAM  Half-Power Point (HPP) analysis of RapidIO HPP is defined as transfer size at which ½ of peak BW is achieved [1] Useful indication of interconnect performance for common transfer sizes Compare HPP of RapidIO to other networks (e.g. Ethernet, Infiniband, etc…) Application case studies  Pulse compression Processing time vs. varied locality Comparison to CFAR  CFAR Break down processing time into I/O, computation components HW implementation vs. SW implementation L. Dickman, “Beyond Hero Numbers: Factors Affecting Interconnect Performance,” Pathscale Inc. White Paper, June 2005 [1]

12/13/05 12 Experimental Results: HCS-CNF Baseline Performance Currently, SDRAM controller only supports single-word accesses  Controller optimization to be done soon  For now, significantly sub-optimal performance Compare new HCS-CNF performance to bare Xilinx RapidIO core Local read/write performance measurements  Local read: ns, 4096 bytes (transfer time linear w/ local transfer size) Mbps throughput 80 ns latency, or 10 clock cycles per word transfer  Local write: ns, 4096 bytes (transfer time linear w/ local transfer size) Mbps throughput 72 ns latency, or 9 clock cycles per word transfer Remote NREAD/SWRITE throughput measurements  Transfers performed into/out of both local SRAM and SDRAM Recall that remote SRAMs are not visible, all remote accesses address SDRAM storage of remote nodes Remote (SRAM) = read remote data into local SRAM, or write data from local SRAM to remote location Remote (SDRAM) = read remote data into local SDRAM, or write data from local SDRAM to remote location  Fixed transfer sizes, no bus contention used for each measurement: “best-case” results Local – 4 KB; linear time/size relationship, i.e. constant throughput Remote (SRAM) – 8 KB; max transfer size possible Remote (SDRAM) – 1 MB; large enough for near-ideal BW Write throughput Read throughput

12/13/05 13 Experimental Results: Half-Power Point (HPP) Analysis HPP is useful comparative metric to evaluate different interconnects  Lower HPP implies better throughput efficiency for realistic application transfer sizes  RapidIO calibration results compared against experimentally measured throughputs of GigE, Infiniband Figure (a) illustrates how to use HPP to compare different interconnects Figure (b) shows that RapidIO has the lowest HPP of all networks considered, with half available throughput being achieved with transfer sizes of 1 KB  Versus ~4 KB for Infiniband, with worst at nearly 16 KB for Gigabit Ethernet  It should be noted: comparison in Figure b is not completely fair comparison Infiniband and GigE results measured over MPI, while RapidIO results measured with minimum overhead  Will repeat experiments using HCS-CNF with optimized SDRAM Figure (b) – HPP of various network technologiesFigure (a) – HPP concept diagram (from [1]) * RDMA GigE – special GigE NICs [Figure b] from Ammasso

12/13/05 14 Experimental Results: Pulse Compression Processing time defined as computation + I/O to SRAM Four scenarios performed experimentally:  Read data from local SDRAM, process, and write back to local SDRAM  Read data from remote node, process, and write back to remote node  Read data from local SDRAM, process, and write back to remote node  Read data from remote node, process, and write back to local SDRAM Performing PC on 1024 elements at a time (i.e pt. FFTs) Pulse compression processing time breakdown

12/13/05 15 Experimental Results: CFAR Processing, Hardware and Software Similar experimental setup to PC, also compare HW and SW processing  CFAR processing performed on 4 kB chunks of data at a time Input DMA time:81096 ns (local) or ns (remote) Processing time:1047 clock cycles, 8 ns/cycle = 8376 ns Output DMA time:73020 ns (local) or ns (remote)  I/O time remains constant for HW or SW implementations Memory transfer done independent of processing element Computation time is only difference CFAR processing time breakdown

12/13/05 16 Analysis Baseline HCS-CNF performance  SDRAM currently only supports single-word accesses Burst accesses must be enabled to significantly improve performance Optimization of memory controller state machine could be entire effort in itself (if I am not careful, assuming such detailed effort is somewhat off-topic)  Comparing ‘best-case’ calibration performance results w/ current HCS-CNF: Calibration results represented zero application overhead, therefore un-attainable throughput numbers (limited only by RapidIO implementation itself) HCS-CNF performance should match much more closely with burst transfer support  Comparing remote transactions to/from SRAM vs. to/from SDRAM: Higher internal throughput if transferring through SRAM due to single-cycle access to SRAM Smaller transfers (since maximum transfer size from SRAM would be 8 KB) in turn limit maximum achievable throughput through remote transfers to/from SRAM Should perform remote transfers to/from SRAM if possible, but real benefit is capped by transfer size HPP comparison  Initial experiments suggest RapidIO is well suited for efficient use of throughput in real applications, relative to other popular interconnects  Will perform more fair comparison before drawing solid conclusions  HPP metric might be better suited for comparing different implementations of same interconnect (or at least interconnects claiming to offer similar ideal throughputs) Even though RapidIO indicates lower (i.e. better) HPP than Infiniband, HPP does not consider Infiniband’s higher throughput capability Only used as a metric to approximate an interconnect’s throughput efficiency for arbitrary applications  Will repeat HPP analysis with new endpoint architecture once SDRAM burst-access issue is resolved

12/13/05 17 Analysis Pulse compression  Write operations achieve better performance, whether local or remote Serial PC measurements indicate best approach is to read data locally, then write remotely to pass data for next stage (given asymmetric read/write performance) Comparable I/O and computation time for local accesses, well-suited for double-buffering  Selected FFT core variation, combined with HCS-CNF architecture can support double- buffering of input data After DMA of data into SRAM, data is further loaded into buffers internal to FFT core before processing begins While processing, new data can be transferred to SRAM via DMA Memory and data management expected to become more complicated  Current serial implementation used for early demonstration and analysis of new processing node architecture  Serial, contention-free operation results in deterministic performance CFAR processing  Hardware CFAR processing time dominated by I/O Similar to PC, deterministic processing time associated with serial implementation If inefficient use of processing resources is worth the high-speed processing, HW CFAR outperforms SW CFAR in our testbed  Our PPC runs at a relatively low clock speed (125 MHz), but: HW CFAR processor 125 MHz) reads, processes, and writes one element per clock cycle Software CFAR implementation would need to be run on very fast processor to keep up Corner turns  Looking ahead at how a corner turn would be implemented on our testbed, memory access management and orchestration becomes a major concern  Also, is it possible to avoid accessing memory in large strides?  Importance and difficulty of corner turn operation warrants specialized study

12/13/05 18 Conclusions Design and verification of HCS-CNF architecture was a successful, yet significant effort  Through collaboration with David Bueno, attempted to mimic best understanding of Honeywell’s processing element architecture  One major design revision required earlier in semester, resulting architecture is very flexible and extendable Initial performance testing shows sub-optimal throughput due to single-access SDRAM controller  Enabling burst-access to SDRAM storage to be done soon, expected to drastically improve throughput efficiency  Single access was considered acceptable for prototyping and initial functional verification of new HCS-CNF architecture (i.e. getting the node to work is higher priority at first than optimizing performance) Comparison of interconnects via half-power point  HPP shows RapidIO in favorable light compared to other popular interconnect technologies  Looking beyond maximum throughput numbers (which are achieved with very large transfers), efficiency of a particular technology for more common transfer sizes can be quantified with HPP Nearly complete GMTI implementation on RapidIO testbed  Pulse compression (Doppler processing very similar to PC) and CFAR already built and tested  STAP and corner turn orchestration all that remain for full GMTI hardware implementation  Memory and data management is currently a non-trivial task! Applications implemented over RapidIO demonstrate transparency of remote memory to processing elements  Deterministic nature of serial implementation results in predictable performance  Multi-node, parallel implementations will produce more valuable insight from testbed experimentation Wide range of future testbed research directions opened up with latest hardware Switch integration TOP priority for testbed work, to be continued through Christmas break  Already gathered sufficient reference material from Tundra regarding correct PCB design and layout for the Tsi500  Primary action is to assemble custom PCB for sampled Tsi500 device we already have in-house  Secondary action is to pursue Virtex-4 board, and experiment with Praesum RapidIO switch core  Have reached the end of the road with Honeywell’s MIP switch card (??)

12/13/05 19 Future Work Focus on integration of switch, testbed expansion  High-priority task, increase node count to four CNFs  Implement burst-access for SDRAM device, improve throughput efficiency  Enable full application studies, generic traffic pattern benchmarking  Layout and fabricate custom PCB for sampled Tundra Tsi500 we currently have, possibility of Virtex-4 and Praesum switch core Double-buffering performance analysis  CNF design supports double-buffering of input data for processing  Experimental verification of simulated double-buffering study Corner-turn kernel case study  Implement generic corner turn operation benchmark  Study performance limitations, investigate optimum approaches/scheduling Embedded multiprocessor HCS-CNF design using both PPCs in each Virtex-II Pro  Current CNF design built with future in mind for later incorporating both available PPCs  Minor adjustments to memory controller state machine, instantiate another DMA engine  Parallel processing within each CNF Full, parallel GMTI implementation  STAP, corner turns are only remaining stages to implement to complete GMTI  Using both PPC (for STAP and CFAR) and hardware modules (for PC and Doppler Processing) to perform various subtasks, implement GMTI on our RapidIO testbed  Acquire real input data for verifiable results  Dr. John Samson’s ideas for CFAR optimization an option to explore Other killer space apps over RapidIO  In addition to GMTI, perhaps also investigate other applications of interest  Specific applications selected with direction of sponsors Larger SDRAM modules for processing nodes  For more realistic storage capacity, small side-task to fabricate custom SDRAM modules for each CNF  Likely 64 MB per node (256 MB total), on custom PCB that fits onto standard headers on FPGA board

12/13/05 20 RapidIO Fault-Tolerant Architecture Research

12/13/05 21 R&D on Fault-Tolerant RapidIO Systems Most space systems handle network FT by duplicating entire network  Essentially the only option for traditional bus-based systems  May not be most efficient use of switched fabric provided by RapidIO Project goal: Research options for network fault-tolerance for next- generation RapidIO-based space systems Use original SBR backplane designs as starting point  Modify design for fault tolerance Determine optimal architecture in terms of performance and fault tolerance  Also consider size, weight, power constraints by keeping number of powered and unpowered switches small Example FT RIO System from Honeywell

12/13/05 22 Project Approach Thorough survey of technical literature performed Spring ’05  FT techniques, fault models, metrics and methodologies  Studied and categorized important literature  Several current designs based on literature on Multi-stage Interconnection Networks (MINs) Preliminary evaluation of promising architectures performed Summer ’05 (Casey Reardon)  Research focused on extra-stage networks and simple redundancy  Laid groundwork for methodology used for this phase of work Research this semester focuses on additional architectures, using benchmarks designed to heavily stress communication fabric  Two architectures under study feature 16-port switches to give insight into design possibilities using higher port-count switches available once serial RapidIO is space qualified Two main methods used for comparison of systems  Analytical metrics used to compare basic characteristics of networks under study  Simulation experiments used to gain further insight into performance and fault tolerance of each architecture

12/13/05 23 Architectural Design Constraints Baseline system design supports 32 RapidIO endpoints in a Clos- like network configuration  Assume each processing device in system has separate primary and redundant RapidIO interfaces  8-bit, 250 MHz parallel RapidIO Network designs now considered to be “system independent”  Mapping of processors and switches to boards and backplanes not a primary initial concern  Enables study of the architectures from a pure networking standpoint Systems must be able to withstand loss of any single switch (at minimum) and maintain connectivity to all devices

12/13/05 24 Benchmark Kernel Summary Results from Summer ’05 used matrix multiply, LU decomposition, and parallel FFT to measure performance of FT systems  Early benchmark kernels did not stress networks enough to create significant performance differences between most architectures in simulation  If network is highly over-provisioned, performance will not vary much as seen in initial results  New benchmarks use more processors, large data sizes, and challenging communication patterns to stress network Two primary benchmark kernels selected for this research  Parallel matrix multiply out of global memory 512x512 matrices with 8 bytes per element 16 or 28 processors 4 global-memory ports Many-to-few communication pattern, very similar to SAR algorithm studied previously  Distributed corner turn 100 KB of data sent from each processor to every other processor 16, 24, or 32 processors Synchronized and unsynchronized versions All-to-all communication patterns, very similar to straightforward par. partitioning for GMTI

12/13/05 25 Baseline Clos Network Non-blocking architecture supporting 32 RapidIO endpoints FT accomplished by completely duplicating network (redundant network not shown) Withstands 1 switch fault while maintaining full connectivity Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Baseline

12/13/05 26 Redundant First Stage Network Similar to baseline, but first level has switch-by-switch failover using components that multiplex 8 RapidIO links down to 4  Must consider muxes as potential point of failure Second-level FT handled by redundant-paths routing  Full connectivity maintained as long as 1 of 4 switches remains functional  Could also supplement with redundant second level using switch-by-switch failover at cost of more complex multiplexing circuitry Muxes may present single point of failure, so processor-level redundancy may be needed Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage (8:4)18

12/13/05 27 Redundant First Stage Network: Extra-Switch Core Adds additional core switch to redundant first stage network  Switch may be left inactive and used in event of fault Second-level FT handled by redundant paths routing  Requires switches with at least 9 ports in first level, 8 ports in second level  Multiplexers must be 10:5 rather than 8:4 Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: Extra-Switch Core (10:5)19

12/13/05 28 Redundant First Stage Network: No Muxes Muxes add additional complexity and may be a point of failure  May be challenging to build LVDS mux components Design requires 16-port switches in backplane, but only need 8 active ports per switch  High port-count switches will be enabled through space-qualified serial RapidIO  For future serial RIO, assume Honeywell HX5000 SerDes with GHz x 4 lanes (possible per Honeywell High-Speed Data Networking Tech. data sheet, June ’05) Roughly equivalent to 16-bit, MHz DDR parallel RIO  For this research, using parallel RIO clock rates for fair comparison Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes

12/13/05 29 Redundant First Stage Network: No Muxes + Extra-Switch Core Combines methodologies from previous two architectures shown Requires 9-port switches in first level, 16-port switches in second level  Realistically attainable using serial RIO Availability of a 32-port serial switch would greatly simplify design (1-switch network!)  Preferred FT approach would tend towards “redundant network” approach for fabrics of these sizes Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Redundant First Stage: No Muxes + Extra-Switch Core

12/13/05 30 Extra-Stage Clos Network Several extra-stage networks explored in Summer ’05 research  Concept based on Extra Stage Cube MIN from Purdue  Most promising of those designs selected for future study (shown) Baseline configuration (no faults) bypasses first network stage  8:4 muxes needed to select between stages  Under switch fault conditions, first stage is used to bypass faulty switch(es) Routing in these systems more complex than other systems under study  Load balancing a major challenge Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (2 nd - level fault) Number Switches to Reroute (3 rd - level fault) Extra-Stage Clos Network (8:4)118

12/13/05 31 Fault-Tolerant Clos Network Architecture studied at NJIT in 1990s, adapted here for RapidIO Uses multiplexers (4:1) for more efficient redundancy in first level  Only requires 1 redundant switch for every 4 switches in first stage  Multiplexer components are no longer a potential single point of failure for connectivity of any processors Has additional switch in second level, similar to other architectures shown Requires 9-port switches in first level, 10-port switches in second level  24-endpoint version possible using only 8-port switches and 3:1 muxes Can withstand 1 first-level fault on either half of network with no loss in functionality or performance  Compromise on fully-redundant first-stage approaches in terms of FT and size/weight/cost Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch Ports Mux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Fault-Tolerant Clos Network (4:1)58

12/13/05 32 Results: Matrix Multiply All faulty switches are placed in second stage  First stage fault results are trivial in most cases due to redundancy Trends similar for both system sizes  In general, performance dominated by contention for 4-port global memory  Not much variation for any systems Best performing systems were ones with extra-switch core stage  Includes Fault-Tolerant Clos  Able to withstand 1 fault with no drop in performance Cases where performance appears to improve with 1 fault are “in the noise”

12/13/05 33 Results: Corner Turn Much more variation in execution time due to vastly increased stress on entire network Trends seen in matrix multiply are now amplified  Extra-switch core designs excel even more in 1-fault and 2-fault cases  Other designs suffer due to lack of core-stage redundancy  Extra stage network suffers due to routing complications and difficulty in load balancing Additional simulation results (not shown) verify this trend Same FT is provided by redundant first stage with much more simplicity 32-processor benchmark further amplifies differences between networks under study

12/13/05 34 Additional Results: Corner Turn Synchronized corner turn uses barrier sync. before each data transfer of each corner turn (upper chart)  Found to be very helpful in earlier GMTI simulations In cases of limited contention (0 faults), sync. version actually slightly slower  Deterministic behavior when full bandwidth is available limits need for sync. because algorithm stays synchronized without intervention In GMTI, incoming data would disturb the sync., making barriers necessary In cases of contention (1 or 2 faults), sync. greatly improves performance  This trend observed for all other systems as well Whenever contention introduced, sync. helps corner turn performance Lower chart demonstrates possible problem when over-provisioning network  Corner turn scheduling optimized for baseline network configuration  Simply adding additional switch to network core (i.e. second level) actually hurts performance by introducing conflicts into scheduling  5 th core switch should be left inactive in order to save power and simplify scheduling Applies to all extra-switch-core type networks studied (Fault-Tolerant Clos used as an example)

12/13/05 35 Analytical Results: Summary Active Switches Standby Switches Total Switches Active Ports Per Switch Total Switch PortsMux Count Number Switches to Reroute (1 st - level fault) Number Switches to Reroute (2 nd - level fault) Baseline Redundant First Stage (8:4)18 Redundant First Stage: Extra-Switch Core (10:5)19 Redundant First Stage: No Muxes Redundant First Stage: No Muxes + Extra-Switch Core Extra-Stage Clos Network (8:4)118 Fault-Tolerant Clos Network (4:1)58

12/13/05 36 Conclusions Variety of tradeoffs studied in FT and performance for Clos-like RapidIO networks Multiplexing components used in several networks studied  Especially helpful for first-level switch fault tolerance  Actual hardware implementation of mux component may be challenging and component may present point of failure in some systems Serial RapidIO provides powerful alternative to multiplexer components through high switch port count and may soon be space qualified  Many ports in our designs may be left inactive, potentially saving power while maintaining flexibility Extra-switch core designs a key technique for providing fault tolerance and performance  Extra switch may be left inactive to conserve power and simplify routing with no loss in performance Fault-Tolerant Clos network was very strong both analytically and in simulation  Allows 1 first-level switch to serve as a backup for 4 others  Uses slightly more complex multiplexer components, but structured in such a way that components do not represent a single point of failure In other systems with multiplexers, board-level redundancy may be necessary  Also has an extra-switch core for second-level FT Further experiments exposed flaws in extra-stage designs previously studied  Redundant first stage provides all benefits of extra-stage designs, with reduced routing complexity General rule: Less switch ports create need for more switches and more clever architectures such as Fault-Tolerant Clos General rule: More switch ports enable less switches and push architectures back towards conventional redundancy-based approaches (i.e. a single 32-port primary/redundant switch)

12/13/05 37 Potential Future Work (from earlier ) Testbed Research  Switch integration (board design)  Finish complete testbed GMTI implementation (add STAP and Doppler processing)  RapidIO endpoint design tradeoffs and enhancements  Testbed FDM implementation  Additional algorithms of interest implemented on testbed in FGPAs/PPCs  Testbed SAR implementation (later in the year)  Extend testbed with serial RapidIO Modeling and Simulation Research  Concurrent approach with testbed GMTI implementation Create detailed models of GMTI using architecture similar to the one studied at Honeywell in Summer 05 Include detailed simulation of memory accesses and model contention between memory and network Validate models using testbed and extend work beyond the capabilities of the testbed Add latency-sensitive "control layer" to GMTI case study  Research into optimal method for meeting hard real-time deadlines in a RapidIO based system Compare to circuit switched networks Latency-sensitive "control layer" for GMTI could incorporate into this research  Perform similar tasks for SAR or other complete case studies of interest such as FDM