Advanced Processing Group ISO9001 Registered Octopus: A Multi-core implementation Export of this products is subject to U.S. export controls. Licenses.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

NIDays 2007 Worldwide Virtual Instrumentation Conference

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

StreamBlade SOE TM Initial StreamBlade TM Stream Offload Engine (SOE) Single Board Computer SOE-4-PCI Rev 1.2.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Performance Characterization of the Tile Architecture Précis Presentation Dr. Matthew Clark, Dr. Eric Grobelny, Andrew White Honeywell Defense & Space,

CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.

Chapter 17 Parallel Processing.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Figure 1.1 Interaction between applications and the operating system.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

Router Architectures An overview of router architectures.

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.

Hardware Overview Net+ARM – Well Suited for Embedded Ethernet

PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.

Asis AdvancedTCA Class What is a Backplane? A backplane is an electronic circuit board Sometimes called PCB (Printed Circuit Board) containing circuitry.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Gigabit Routing on a Software-exposed Tiled-Microprocessor

Advantages of Reconfigurable System Architectures

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

Computer System Architectures Computer System Software

HyperTransport™ Technology I/O Link Presentation by Mike Jonas.

UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.

CHAPTER 11: Modern Computer Systems

Silicon Building Blocks for Blade Server Designs accelerate your Innovation.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

SLAC Particle Physics & Astrophysics The Cluster Interconnect Module (CIM) – Networking RCEs RCE Training Workshop Matt Weaver,

Confidential 1 SpecificationsFeatures ProcessorFreescale MPC8640 Single 1 GHz DDRAMDual channel DDR2 with ECC, 512 MB (expandable up to 2GB) Flash.

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

VPX Cover Overview and Update “VPX, Open VPX, and VPX REDI ” are trademarks of VITA.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Srihari Makineni & Ravi Iyer Communications Technology Lab

David Abbott - JLAB DAQ group Embedded-Linux Readout Controllers (Hardware Evaluation)

Slide ‹Nr.› l © 2015 CommAgility & N.A.T. GmbH l All trademarks and logos are property of their respective holders CommAgility and N.A.T. CERN/HPC workshop.

CS 4396 Computer Networks Lab Router Architectures.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Rad Hard By Software for Space Multicore Processing David Bueno, Eric Grobelny, Dave Campagna, Dave Kessler, and Matt Clark Honeywell Space Electronic.

© 2007 Altera Corporation FPGA Coprocessing in Multi-Core Architectures for DSP J Ryan Kenny Bryce Mackin Altera Corporation 101 Innovation Drive San Jose,

WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Johannes Lang: IPMI Controller Johannes Lang, Ming Liu, Zhen’An Liu, Qiang Wang, Hao Xu, Wolfgang Kuehn JLU Giessen & IHEP.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

USB The topics covered, in order, are USB background

HyperTransport™ Technology I/O Link

How to Quick Start Virtual Platform Development

System On Chip.

Appro Xtreme-X Supercomputers

Group Manager – PXI™/VXI Software

Low Latency Analytics HPC Clusters

Dr. Jeffrey M. Harris Director of Research and System Architecture

Hybrid Programming with OpenMP and MPI

Cluster Computers.

Presentation transcript:

Advanced Processing Group ISO9001 Registered Octopus: A Multi-core implementation Export of this products is subject to U.S. export controls. Licenses may be required. This material provides up-to-date general information on product performance and use. It is not contractual in nature, nor does it provide warranty of any kind. Information is subject to change at any time. Kalpesh Sheth HPEC 2007, MIT, Lincoln Lab

Advanced Processing Group ISO9001 Registered 2 Parameters to evaluate? Many vendors have multi-core, multi-chip boards Characteristics of good evaluation  How much memory BW among multiple cores? (i.e. core to core)  How much I/O BW between multiple sockets/chips on same board? (i.e. chip to chip on same board)  How much fabric BW across boards? (i.e. board to board) CPU performance with I/O combined  Data has to come from somewhere and go into memory

Advanced Processing Group ISO9001 Registered 3 Multi-core Challenge Current Benchmarks  Most don’t involve any I/O  Many are cache centric  Many are single core centric (no multi-threading) Questions to ask?  Interrupt handling among multiple cores  Inter-process communication  How many channels for DDR2 interface? 4 or better?  Size, Weight and Power (SWaP) when fully loaded?  How to debug multi-threaded programs? Cost of tools?  What is the cost of ccNUMA or NUMA? Memory R/W latency?

Advanced Processing Group ISO9001 Registered 4 Traditional Intel Platform Single entry point for memory access All external I/O via Southbridge  GigE, UART competes with Fabric  Always requires CPU cycles

Advanced Processing Group ISO9001 Registered 5 Traditional PowerPC Platform Local memory access Mostly no I/O RapidIO (parallel or serial) switching Limited BW between chips Fabric bottleneck as all data comes via fabric only

Advanced Processing Group ISO9001 Registered 6 DRS Approach Each System on Chip (SoC) has HyperTransport switch Local memory access with NUMA and ccNUMA capability Data can come locally or via fabric

Advanced Processing Group ISO9001 Registered 7 HT/ SPI-4 HT/ SPI-4 HT/ SPI-4 ZBbus SoC I/O: -4 GigE -64b PCI-X -System I/O Memory Bridge Port0 SB-1 Core1 SB-1 Core2 SB-1 Core3 SB-1 Core0 Shared L2 Mem Ctrl Packet DMA X Port1Port2 ZBbus, Split Transaction, Coherent, 128 Gbps Multi-Channel Packet DMA Engine Remote Memory Requests & ccNUMA Protocol On-Chip Switch, 256 Gbps bandwidth, 5-ports 19.2 Gbps in each direction 4-Issue SuperScalar MIPS64 SB-1 (8 GFLOPS/Core) 102 Gbps Memory Controller BCM1480 Architecture Overview

Advanced Processing Group ISO9001 Registered 8 Performance measurement How do you measure performance of multi-core embedded system?  Perform network I/O while doing number crunching – Examples uBench, netPerf with FFTw  Measure memory BW with streams benchmark  Measure intra-core and inter SoC BW performance – Examples openMPI (measures latency and BW)  How open standards are supported? – Examples CORBA, VSIPL, FFTw, RDMA, MPI etc.  Measure switch fabric BW and latency  XMC and PMC plug ability and their interface speed  Boot time (in seconds)

Advanced Processing Group ISO9001 Registered 9 VSIPL Benchmarks

Advanced Processing Group ISO9001 Registered 10 Octopus as the Open Source/Standard and COTS commodity Solution Leverages open source development & run time environments  Including Linux OS (SMP), Eclipse IDE, GNU tools  Promotes ease of portability and optimal resource utilization Delivers open standard middleware & signal processing libraries  Including VSIPL, MPI, FFTw and CORBA Effectively decouples the application software from the hardware  Applications interface to the OS at a high level (layer 4) Utilizes standards based hardware  VITA 41 (VXS) backplane – backwards compatible with VME  VITA 42 (XMC) mezzanine standards - permits rapid insertion of new technology Implemented using commodity chips taken from large adjacent markets  Broadcom dual/quad core processors from Telecom/Datacom Network Processing  HyperTransport from commodity computing  Dune fabric from the Tera-bit router market Portability & Inter-operability  Reduced Life Cycle Cost 21 Slot

Advanced Processing Group ISO9001 Registered 11 Evaluation Independent evaluation done (Summer 2007) Summary of Results (based on 8-slot chassis)  166 GFLOPS sustained throughput  4.6 FLOPS/byte (2G DDR2 per BCM1480 SoC)  136 MFLOPS/W computation efficiency  4.9 GFLOPS/Liter computation density (not counting I/O) Contact DRS for more details

Advanced Processing Group ISO9001 Registered 12 Thank You For more information: Advanced Processing Group 21, Continental Blvd, Merrimack, NH Phone: x326

Advanced Processing Group ISO9001 Registered Backup Slides

Advanced Processing Group ISO9001 Registered 14 1 Gig. E. 41 Switch slotVME slot VITA 41 (OCTOPUS) System Overview High Performance Embedded Multi-computing system backwards compatible with VME High Speed scalable advanced switch fabric with telecoms grade reliability Separate and redundant out-of- band (VITA 41.6) GigE control plane to every processing element Linux OS (SMP or ccNUMA) Octopus boards  Shown in standard VITA 41 chassis’  Sourced from commodity chassis vendor  Alternative backplane configurations exist and are supported as defined in VITA Switch slot 41 slot VME 10 Gig. E. Fabric V41 Switch VME Payload 41 or VME V41 Switch 21 Slot

Advanced Processing Group ISO9001 Registered 15 OCTOPUS Switch Card “Smart Switch” acts as Host controller  Controls boot sequence  Provides services to system Data Plane Connectivity  Measured net 1.1 GBytes/s full duplex between any payload boards after fabric overhead and encoding/decoding  Provides dynamic connectivity (no more static route tables) and single layer of switching between payloads Control Plane GigE Connectivity  1 GigE to every payload board and front panel  10 GigE between switch cards Dual Core Processor on board  Each core running Linux at 600 MHz  256 MB DDR SDRAM at 125 MHz DDR memory bandwidth  User (64 MB) and boot (4 MB) flash Temperature Sensors  1250 processor and each fabric switch

Advanced Processing Group ISO9001 Registered 16 OCTOPUS Motherboard High performance Quad Core processor  MIPS64 SoC (System on a Chip) processor  8 GFLOPS/processor (32 GFLOPS/Chip) with 4.8 GByte/s composite I/O  Running SMP Linux at 1GHz on each core (ccNUMA capable) 2 GB of DDR2 SDRAM Flash  User (128MB) and boot (4MB) Front panel GigE and USB Power ~ 45W (Max) Temperature Sensors  1480 processor and fabric end-point

Advanced Processing Group ISO9001 Registered 17 OCTOPUS Dual XMC VITA 42.4 compliant 2 x High performance Quad Core processors  MIPS64 SoC processors  Running SMP Linux at 1GHz on each core (ccNUMA capable) With each Quad Core processor:  2 GB of DDR2 SDRAM (3.2 GB/s)  4 MB of flash Total of 8 x 1 GHz cores, 4 GB DDR2 SDRAM and 8 MB boot flash! Power ~ 65 W Temperature Sensors  For each 1480 processor

Advanced Processing Group ISO9001 Registered 18 OCTOPUS Payload Slot Fully loaded payload slot (Motherboard plus Dual XMC) provides: 96 GFLOPS = 3 x High performance MIPS64 Quad BCM GB of DDR2 SDRAM Flash  User 128 MB  Boot 12 MB Node aliasing and ccNUMA On board PCI-X and HyperTransport Off board VITA 41 serial fabric Power ~ 110W (Max)

Advanced Processing Group ISO9001 Registered 19 Inter-SoC Connectivity

Advanced Processing Group ISO9001 Registered Figures of Merit Comparison 70 W/Liter convection cooling limit Chassis with Radstone/PowerPC (G4DSP) card (not counting I/O bottlenecks) Chassis (counting I/O bottlenecks) DRS Octopus Chassis Note: external I/O as needed via PMC sites in both systems

Advanced Processing Group ISO9001 Registered 21 Timing Results Normalized by Volume Cell Processor software effort in process Multi-core Board Volume per Board Gcc/Linux Multi- threading Gcc/Linux Multi- processes Vendor Compiler Multi- threading Vendor Multi- threading Vendor Compiler Multi- processes Vendor Multi- processes Cu. InchesPer cu. In. Intel Sossaman ,817, ,794, ,636, ,727,754 Intel Dempsey ,181, ,049, ,600, ,432,983 Intel Woodcrest ,952, ,683, ,099, ,443,337 Intel Montecito , ,413, , ,268,689 DRS-IT MIPS ,196, ,249,655 Fabric7 Opteron ,237, ,508,441 Octopus SAR Benchmark Comparison Data Presented at Processor Technology Symposium on 10/3/06 Best SWaP Performance

Advanced Processing Group ISO9001 Registered 22 Signal Recognizer Octopus Shows SWaP Advantage over Blade Servers

Advanced Processing Group ISO9001 Registered 23 Development Environment

Advanced Processing Group ISO9001 Registered 24 Advanced Debug Tools