High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory 23-25.

Slides:



Advertisements
Similar presentations
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Traditional Approach to Design
Introduction CSCI 444/544 Operating Systems Fall 2008.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Lecturer: Sebastian Coope Ashton Building, Room G.18 COMP 201 web-page: Lecture.
Technical Architectures
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Distributed Object Computing Weilie Yi Dec 4, 2001.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
Recall The Team Skills 1. Analyzing the Problem (with 5 steps) 2. Understanding User and Stakeholder Needs 3. Defining the System 4. Managing Scope 5.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
Application architectures
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
Figure 1.1 Interaction between applications and the operating system.
1 of 14 1 Analysis and Synthesis of Communication-Intensive Heterogeneous Real-Time Systems Paul Pop Computer and Information Science Dept. Linköpings.
Vector Multiplication & Color Convolution Team Members Vinay Chinta Sreenivas Patil EECC VLSI Design Projects Dr. Ken Hsu.
BENCHMARK SUITE RADAR SIGNAL & DATA PROCESSING CERES EPC WORKSHOP
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CASTNESS‘11 Computer Architectures and Software Tools for Numerical Embedded Scalable Systems Workshop & School: Roma January 17-18th 2011 Frédéric ROUSSEAU.
Upgrade to Real Time Linux Target: A MATLAB-Based Graphical Control Environment Thesis Defense by Hai Xu CLEMSON U N I V E R S I T Y Department of Electrical.
1 Memory Technology Comparison ParameterZ-RAMDRAMSRAM Size11.5x4x StructureSingle Transistor Transistor + Cap 6 Transistor Performance10.5x2x.
Dynamic Reconfiguration Dynamic selection of handler functionality: currently through use of parameterizable handlers or by selecting from a set of existing.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Middleware for FIs Apeego House 4B, Tardeo Rd. Mumbai Tel: Fax:
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
Processes Introduction to Operating Systems: Module 3.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
A Systematic Approach to the Design of Distributed Wearable Systems Urs Anliker, Jan Beutel, Matthias Dyer, Rolf Enzler, Paul Lukowicz Computer Engineering.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
An Open Architecture for an Embedded Signal Processing Subsystem
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
WASP Airborne Data Processor Laboratory for Imaging Algorithms and Systems Chester F. Carlson Center for Imaging Science Rochester Institute of Technology.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
1 WRB 09/02 HPEC Lincoln Lab Sept 2002 Poster B: Software Technologies andSystems W. Robert Bernecky Naval Undersea Warfare Center Ph: (401) Fax:
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
9/30/2016 Distributed System Exploration Mirabilis Design Inc.
SMART CAMERAS AS EMBEDDED SYSTEM SMVEC. SMART CAMERA  See, think and act  Intelligent cameras  Embedding of image processing algorithms  Can be networked.
Auburn University
Definition of Distributed System
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Texas Instruments TDA2x and Vision SDK
Recall The Team Skills Analyzing the Problem (with 5 steps)
Parallel Processing in ROSA II
Presentation Title September 22, 2018
Performance Optimization for Embedded Software
Data Path through host/ANP.
VSIPL++: Parallel Performance HPEC 2004
Parallel Programming in C with MPI and OpenMP
Multichannel Link Path Analysis
Presentation transcript:

High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory Sept 2003 Tom McClean Lead Member Engineering Staff

Hard Real Time DSP Challenge Develop a Portable and Easily Scalable DSP – Portability requires the use of Open Architecture Standards Low Overhead Communication – Message Passing Interface (MPI) Vector Processing – Vector Signal Image Processing Library (VSIPL) Programming Language (C++) – Scalability requires: An Infrastructure which is highly configurable. – Number of Processors – Round Robin, Pipeline or Hybrid Data Flow – Data Redistribution Support Frees the algorithm designer from the details of the data distribution Open Architecture Standards allow for Platform Flexibility

Digital Processor Block Diagram 1 Digital Signal Processor (DSP) Control Message (STIM) Detection Report Legend Processor Data Path Sensor Data Radar Control Processor (RCP)

Real Time DSP Solution DSP Infrastructure Description – Flow Graph Defines the Data Flow and Algorithm Mapping to a Network of Processors – Based on a Static Map of DSP processors – Infrastructure Supports Multiple Flow Graphs – Text File Based (Easily Modified) – Loaded during Software Initialization – Easy to add algorithms or modify data flow – MPI Intercommunicators are formed based on Flow Graph information. Provides Stimulus and Data Flow Paths. Redistribution API uses the formed Data Paths. – Infrastructure has been tested on Server and Embedded architectures using more than 64 processors. No code modification is needed. DSP recompiled for the particular architecture. Infrastructure is Platform Independent

Flow Graph Details MPI Stimulus and Data flow Communication paths are formed based on information read from text Flow_Graph files during initialization. Example DSP Flow Graph ProcessorMODEPurposeGroupGroup Size Group Count InputOutput 0RCPSTIM111NONE 1DSPFE_SUM111NONEPC_SUM 2DSPPC_SUM111FE_SUMNCI,CFAR 3, 4DSPNCI121PC_SUMCFAR 5DSPCFAR112PC_SUM,NCIPOST_PRO 6, 7DSPCFAR222PC_SUM,NCIPOST_PRO 8DSPPOST_PRO111CFARNONE Reconfiguration does not require code modification

Flow Graph Communicators Resulting Stim and Data Communication Paths SUM Channel Radar Control Processor Legend MPI Processor Data Path 1 Control Path Front End Non Coherent Integration Pulse Compression CFAR Group 1 Post Processing CFAR Group 2 Round Robin Data Path 2 Combined Data Path 1 Radar Stimulus Comm 1 Radar Stimulus Comm

Pipeline Flow Graph Pipeline DSP Flow Graph ProcessorMODEPurposeGroupGroup Size Group Count InputOutput 0RCPSTIM111NONE 1DSPFE_SUM111NONEPC_SUM 2DSPFE_AZ111NONEPC_AZ 3DSPFE_EL111NONEPC_EL 4DSPPC_SUM111FE_SUMNCI 5DSPPC_AZ111FE_AZNCI 6DSPPC_EL111FE_ELNCI 7,8DSPNCI121 FE_SUM, FE_AZ, FE_EL CFAR 9DSPCFAR111NCIPOST_PRO ADSPPOST_PRO111CFARNONE

Pipeline Processing Resulting Control and Data Communication Paths SUM Channel Radar Control Processor MPI Processor Data Path Control Path NCI Pulse Compression CFAR BETA Channel Front End Pipeline Processing Post Processing ALPHA Channel Radar Stimulus Comm 1 1 Stim Communication path is formed A A 0 0

Hybrid Flow Graph Hybrid DSP Flow Graph Processor MODEPurposeGroupGroup Size Group Count InputOutput 0RCPSTIM111NONE 1DSPFE_SUM111NONEPC_SUM 2DSPFE_AZ111NONEPC_AZ 3DSPFE_EL111NONEPC_EL 4DSPPC_SUM111FE_SUMNCI 5DSPPC_AZ111FE_AZNCI 6DSPPC_EL111FE_ELNCI 7DSPNCI114 FE_SUM, FE_AZ,FE_EL CFAR 8DSPNCI114 FE_SUM, FE_AZ,FE_EL CFAR 9DSPNCI114 FE_SUM, FE_AZ,FE_EL CFAR ADSPNCI114 FE_SUM, FE_AZ,FE_EL CFAR BDSPCFAR111NCIPOST_PRO CDSPPOST_PRO111CFARNONE

Hybrid Processing Resulting Stim and Data Communication Paths SUM Channel Radar Control Processor MPI Processor Data Path NCI CFAR Post Processing AZ Channel EL Channel Front End Hybrid Processing 4 Distinct Communication paths are formed. A path is used per Radar Sequence. (ROUND ROBIN) Stim distributor determines which Comm path is in use. Stimulus is only distributed to processors that need it. Path 1 Path 2 Path 3 Path 4 Pipeline Round Robin Pulse Compression BC A 0

Multiple Flow Graphs Pipe 1 Receiver Interface Pipe 0 Legend Processor Data Flow Pipe 2 Post Processing Flow Graph 2 Optimized for Small Sample Lengths Flow Graph 1 Optimized for Large Sample Lengths Channel 3 Pipe 3 Channel 2 Channel 1 Channel 3 Channel 2 Channel 1 Each Flow Graph is Optimized for Radar Modes or Data Size

Data Redistribution Data Flow paths are used for Redistribution Data Redistribution Calls are Layered on MPI Provide 2D->2D with Overlap and Modulo support Insulates Algorithm from Redistribution Algorithm Pseudo Code Fragment // Data input scalable across processors // Receive Input Data blocks = Redist( Buffer, 14, 23, 1, 0);// Perform algorithm on received data for( int i=0; i<blocks; i++){ vsip_ccfftop_f(…); …}// Data output scalable across processors // Send Output Data blocks = Redist( Buffer, 14, 32, 1, 0); VSIPL provides a Platform independent API Data Input From 2 Processors to 3 Processors Data Output From 3 Processors to 1 Processor Developer can Concentrate on Algorithm Implementation

Data Redistribution Without Overlap Data flow Communication paths are used for Redistribution Data Redistribution Calls are Layered on MPI Provide 2D->2D with Overlap and Modulo support Insulates Algorithm from Redistribution – Redist( Data_Buffer, Splitable Dimension, Unsplitable Dimension, Modulo,Overlap); – Redist( Buffer, 14, 6, 1, 0); -Application Redistribution Call Data Buffer 14X6 3 processors processors Redistribute ABCDE89ABCDE Data Buffer 14X6

Data Redistribution With Overlap Redist( Data_Buffer, Splitable Dimension, Unsplitable Dimension, Modulo, Overlap); The Same Call is Made by all 5 Processors Redist( Buffer, 14, 6, 1, 1); -Application Redistribution Call With Overlap 1 Data Buffer 14X6 3 processors processors Redistribute ABABCDE89ABCDE Data Buffer 14X6 Overlapped Data

Data Redistribution With Modulo Redist( Data_Buffer, Splitable Dimension, Unsplitable Dimension, Modulo, Overlap); The Same Call is Made by all 5 Processors Redist( Buffer, 14, 6, 4, 0); -Application Redistribution Call With Modulo 4 Data Buffer 14X6 3 processors processors Redistribute ABCDE89ABCDE Data Buffer 14X6 4 Blocks 4 Blocks+Remainder

Matrix Transpose Ct_Transfer( Data_Buffer, Splitable Dimension, Unsplitable Dimension, Modulo); The Same Call is Made by all 5 Processors Ct_Transfer(Buffer, 14, 6, 1); - Application Matrix Transpose Data Buffer 14X6 3 processors processors Corner Turn Redistribute 1289ABCDE Data Buffer 6X14 2 Blocks 34 56

Summary DSP Infrastructure: – Supports Real-Time High-Performance Embedded Radar Applications Low Overhead Scalable to requirements – Built on Open Architecture Standards MPI and VSIPL – Reduces development costs Scalable to applications with minimal changes to software – Provides for Platform Independence – Provides DSP Lifecycle Support Scale DSP from Development to Delivery Without Code Modifications Add Algorithms with Minimal Software Changes Reusable Infrastructure and Algorithms Easily Scale DSP for Various Deployments Infrastructure Reduces Development Cost