© SKY Computers, Inc. All Rights Reserved 2/8/00 The Decomposition of HPEC Applications Mapped to the Natural Decomposition of a Solution Architecture.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

The University of Adelaide, School of Computer Science
2. Computer Clusters for Scalable Parallel Computing
Lecture 12 Page 1 CS 111 Online Devices and Device Drivers CS 111 On-Line MS Program Operating Systems Peter Reiher.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
Router Architecture : Building high-performance routers Ian Pratt
1 K. Salah Module 4.0: Network Components Repeater Hub NIC Bridges Switches Routers VLANs.
ECE 526 – Network Processing Systems Design
1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Module I Overview of Computer Architecture and Organization.
Study of AES Encryption/Decription Optimizations Nathan Windels.
PHY 201 (Blum) Buses Warning: some of the terminology is used inconsistently within the field.
Computer performance.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
EVOLVING TRENDS IN HIGH PERFORMANCE INFRASTRUCTURE Andrew F. Bach Chief Architect FSI – Juniper Networks.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Computing Hardware Starter.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Computer Maintenance Unit Subtitle: Bus Structures Excerpted from Copyright © Texas Education Agency, All rights reserved.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Silicon Building Blocks for Blade Server Designs accelerate your Innovation.
NETW 3005 I/O Systems. Reading For this lecture, you should have read Chapter 13 (Sections 1-4, 7). NETW3005 (Operating Systems) Lecture 10 - I/O Systems2.
Architecture Examples And Hierarchy Samuel Njoroge.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
J. Christiansen, CERN - EP/MIC
The variety Of Processors And Computational Engines CS – 355 Chapter- 4 `
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
MBG 1 CIS501, Fall 99 Lecture 18: Input/Output (I/O): Buses and Peripherals Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
CS 4396 Computer Networks Lab Router Architectures.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Lecture 25 PC System Architecture PCIe Interconnect
DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK– PARALLEL BUS DEVICE PROTOCOLS 1.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
The Central Processing Unit (CPU)
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
LIGO-G9900XX-00-M LIGO II1 Why are we here and what are we trying to accomplish? The existing system of cross connects based on terminal blocks and discrete.
WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Mohamed Younis CMCS 411, Computer Architecture 1 CMCS Computer Architecture Lecture 26 Bus Interconnect May 7,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
System on a Programmable Chip (System on a Reprogrammable Chip)
CHAPTER 11: Modern Computer Systems
LHCb and InfiniBand on FPGA
Operating Systems (CS 340 D)
System On Chip.
Presented by: Tim Olson, Architect
Chapter III Desktop Imaging Systems & Issues
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Characteristics of Reconfigurable Hardware
Overview of Computer Architecture and Organization
Constructing a system with multiple computers or processors
Overview of Computer Architecture and Organization
Computer Evolution and Performance
Computer Architecture
Chapter 13: I/O Systems.
Cluster Computers.
Presentation transcript:

© SKY Computers, Inc. All Rights Reserved 2/8/00 The Decomposition of HPEC Applications Mapped to the Natural Decomposition of a Solution Architecture

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 2 Historical Solution Drivers for HPEC Applications pCirca  Specialized Co-Processors Tightly Coupled to the Host CPU  Dual-Ported Memories; albeit very Small Amounts  Data Acquisition via Host CPU Bus pCirca  First Wave of Small-Count Multi-Processor Embedded Applications  Localized Non-Shared Memory  Low Bandwidth, Low Functionality Bus-Based Interconnects  Data Acquisition via Direct Parallel I/O Ports on Each Card pCirca  First Wave of Larger-Count Multi-Processor Applications  High Bandwidth, Crossbar Based Interconnects  Lots of Processors and Lots of Small Distributed Memory Pieces  Data Acquisition via Direct Fabric Based Ports on Each Card  The era of the Compact Application Benchmark

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 3 Historical Trends for Application Fitment to Hardware pIt was …..  Finesse the I/O and Computations to Fit the Computer’s Architecture which included Integrated I/O and Computations  Buffer, Rearrange, Move, and Process the Data In-Fabric using Widely-Distributed Small Buffers of Memory  Tightly Couple the Application’s Architecture to the Distributed Memory and Computer’s Architecture pThe Community had an Approach to Understand Behavior and Performance Estimates for Complex HPEC Systems  Standard Benchmarks: 2DFFT, Corner Turns, STAP, etc  Help it with Middleware: MPI/RT, Data Re-Org, VSIPL  But they did not really address the I/O and the impacts thereto to the Overall Processing and Processing Management of Data

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 4 The FOPEN Example From the Pre-Processor CT1 Unpack Data Motion Comp Azimuth CT2CT3 Stolt Interp Resample Range Complex Image Azimuth Matched Filter Range 17.5K Azimuth 96K Azimuth 21K Range 5K Range 17.5K Range 17.5K Azimuth 21K Azimuth 21K Azimuth 96K Range 17.5K Do this on-the-fly into the processor From 2 Byte to 8Byts complex quantities CVMUL 96K CFFT Radix-3 CVMUL SQRT 2 SQRs Non-integer resampling in range Non-integer downsampling to reduce data in Azimuth 1)CFFT,Select 21K,CIFT (Radix-5) or 2) FIR Filters Note 1: To Preserve Memory pack and unpack the data before and after corner turns Note 2: To Preserve and Bandwidth Convert to/from float and Integer Range 5K Azimuth 21K CFFT,Zero Fill, CIFT, Pick 1/4 Pts. Azimuth 20K Range 17.5K

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 5 The Architecture Problem pSmall Chunks of Distributed Memory can be Problematic  Lots of Small Memory Buckets = Lots of Data Movement  Lots of Data Movement = Lots of Bandwidth Needed  Hence the Problem: The Memory becomes in-fabric pSmall Memory Buckets are Challenging to Manage  Data Feeds are Scattered among the Fabric  Partial Data Sets are Unnaturally Broken Up  Many times, way too Much Scatter, Gather, and Re Organization pHPEC Problems Naturally Decompose into Two Key Areas 1.Data Acquisition, Buffering, and Re-distribution 2.High Speed and Highly Complex Computations on Well Bounded Data Sets utilizing Well Bounded Algorithms They have very different Architectural Needs if they are to be Optimally Served

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 6 HPEC Application Needs Mapping Compute Requirements High Performance CPU’s Specialized Compute Nodes Localized Compute Clusters High Speed Interconnects Optimized for Computational Performance per $/Watt/Area High Compute & High I/O Requirements Specialized Compute Nodes Specialized I/O Nodes High Speed Interconnects Heterogeneous Hardware I/O Bandwidth Compute Performance General Purpose Data Processing Requirements General Purpose PC or Server NOT Designed for HPEC Data I/O Requirements I/O Device Nodes: Capture, Buffer, Reorg, and Redistribute High Speed Interconnects Optimized for Data Acquisition and Data Management Services Low High Low High

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 7 What’s Different in Today’s Computer Architectures pProcessor-to-Memory Bandwidth is Huge  64-bit Wide DDR; > 3 GB/Sec memory access speed is possible  Memory is Cheap and Abundant pI/O and Local System Bus Bandwidth is Very High  Commodity Busses e.g. PCI-X > 1GB/Sec  Lots of Peripherals, Lots of Available Software (driver) Support pInterconnect Fabrics are FAST and SMART  1GB/Sec per Port  Self-Discovery, Fault Detection, Recovery, Reliable pStandard Processors are Readily Available; Specialized Devices are Becoming Easier to Use  High Throughput SIMD’s, DSP’s, Fast Server-Centric Processors  FPGA’s, ASIC’s, and Other Custom Logic pToday, it is Easier to Balance I/O and Computational Needs at the Computer Level rather than at the Application Level

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 8 InfiniBand Switch Blade Data Acquisition Server SMARTpac 600 Compute Blade PCI I/O Cards HAA Blade p Six PCI-X adapter slots (PCI 32-bit/33Mhz up to PCI-X 64 bit/100Mhz) p Six compute blade slots for single or dual-processor compute blades  Six InfiniBand connections at rear of chassis  Two RJ45 Ethernet connections for High Application Availability Infrastructure Optimized for Data Acquisition and Data Management Services

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 9 Compute Server SMARTpac 1200 Compute Blade p Twelve compute blade slots for single, dual-processor, or special function compute blades  Six InfiniBand connections at rear of chassis  Two RJ45 Ethernet connections for High Application Availability Infrastructure InfiniBand Switch Blade HAA Blade Optimized for Computational Performance per $/Watt/Area

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 10 Piped Connection Medium Bandwidth – Lower Complexity Data Acquisition ServerCompute Servers IB Switch DAS Node PMC DAS Node PMC DAS Node PMC DAS Node PMC IB Switch Sensor Management Node Disk Archival Ethernet DAS Node PMC Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 11 Fully Connection Mesh High Bandwidth – Higher Complexity Data Acquisition ServerCompute Servers DAS Node PMC DAS Node PMC DAS Node PMC DAS Node PMC IB Switch Sensors Management Node Disk Archival Ethernet DAS Node PMC Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server Compute Server IB Switch

© SKY Computers, Inc. All Rights Reserved 2/8/00 Slide 12 Decomposition Summary pMost HPEC Problems Naturally Decompose into: 1.Data Acquisition and Management Services, and 2.Computational Services pThe Current HPEC Systems built around Raceway™, SKYchannel™, and Myranet™ are representative of an “older school” approach, whereby I/O and Computes are tightly coupled, physically bound together, and use Small Buckets of Fabric-based Shared memory pToday’s Technologies allow one to Think and Actually Implement Differently to Meet an Application’s Actual Decomposed I/O and Processing Needs  Data Acquisition Servers optimized for I/O Services, Data Buffering, Data Management, and Data Distribution  Compute Servers optimized for Signal and Image Processing