Presentation Title September 22, 2018

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Chapter 13 Embedded Systems
Chapter 2 Operating System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Operating System A program that controls the execution of application programs An interface between applications and hardware 1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
TensorFlow– A system for large-scale machine learning
Processes and threads.
Computer Organization and Architecture + Networks
Questions To Ask When Considering Designing Asymmetric Multiprocessing Systems George Cox.
Operating Systems : Overview
Distributed Processors
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Computer Structure Multi-Threading
Embedded Systems Design
Grid Computing.
Task Scheduling for Multicore CPUs and NUMA Systems
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Architecture & Organization 1
Parallel Algorithm Design
Multi-Processing in High Performance Computer Architecture:
Software Architecture in Practice
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Anne Pratoomtong ECE734, Spring2002
Applying Twister to Scientific Applications
Operating Systems : Overview
Architecture & Organization 1
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Software models - Software Architecture Design Patterns
Operating Systems : Overview
Operating Systems : Overview
An Introduction to Software Architecture
Gary M. Zoppetti Gagan Agrawal
Hybrid Programming with OpenMP and MPI
Multithreaded Programming
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Operating Systems : Overview
Operating Systems : Overview
The Vector-Thread Architecture
Chapter 4 Multiprocessors
Operating Systems : Overview
Overview of Workflows: Why Use Them?
Chapter 5 Architectural Design.
Database System Architectures
Chapter 2 Operating System Overview
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Operating System Overview
COMP60611 Fundamentals of Parallel and Distributed Systems
6- General Purpose GPU Programming
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Presentation Title September 22, 2018 Graph Programming Model Efficient Approach For Sensor Signal Processing Mr. Steve Kirsch Principal Fellow Raytheon Space and Airborne Systems 9/10/2012 Copyright © 2012 Raytheon Company. All rights reserved. Customer Success Is Our Mission is a registered trademark of Raytheon Company. Speaker Name

Agenda Graph Programming Model (GPM) goals Overview of GPM basics GPM hierarchical program structure A brief look at how GPM achieves efficiency Conclusion and Path Forward 9/22/2018

Key Goals Of The Graph Programming Model (GPM) Efficiency Portability Scalability Productivity Efficiency – High throughput without exposing processing target details thus maintaining a clean programming abstraction Portability – Require no or minimal application software modifications when target hardware is upgraded or application is move to a significantly different target architecture Scalability – Provide increased application performance with no application modification as target hardware capabilities increase Productivity – Provide process and tools that reduce the complexities of developing a highly efficient parallel program 9/22/2018

GPM: Basic Concept of Operation GPM utilizes Directed Acyclic Graphs (DAGs) as the basic building block Collection of vertices and directed edges Each edge connects one vertex to another, such that there is no way to start at a vertex and follow a sequence of edges that eventually loops back to the same vertex A precedence relationship is established by the data flow pattern 9/22/2018

Agenda Overview of GPM basics Graph Programming Model (GPM) goals GPM hierarchical program structure A brief look at how GPM achieves efficiency Conclusion and Path Forward 9/22/2018

GPM Explicitly Exposes Algorithmic DLP GPM: Basic Concept of Operation DAGs Used To Express Data Level Parallelism (DLP) Datasets Represents multi-dimensional data (eg Radar sensor fast time samples across a row, slow time (pulse to pulse samples) down a column) Typically data dependencies exist in one dimension (or few dimensions) and expose data level parallelism in the other dimensions JobClasses SPMD processing model JobClasses are designed to consume data level parallelism JobClass Views Data parallelism description JobClass SPMD 1.0 1.1 1.n - 1 Dataset 1 Dataset 2 D1.0 D2.0 D1.1 D2.1 D1.n - 1 D2.n - 1 D1.n D2.n SPMD Single Program Multiple Data SPMD Machine GPM Explicitly Exposes Algorithmic DLP

GPM Explicitly Exposes Algorithmic TLP GPM: Basic Concept of Operation DAGs Used To Express Thread Level Parallelism (TLP) SubGraphs JobClasses are grouped into a subGraph A set of subGraphs forms an MPMD machine Subgraphs are separately allocatable processing threads Datasets consumed by a subgraph are allocated to memory local to where the subgraph executes SubGraphs can be allocated to processing resources in real time Datasets Datasets D0 Dn+2 D1 Dn+3 SubGraph 2 DN - 1 DM - 1 DN DM n - 1 GPM Explicitly Exposes Algorithmic TLP n

Data Reorganization is Systematically Handled in GPM GPM: Basic Concept of Operation JobClass Views Express Data Reorganization JobClass View Expresses how data is access by each vertex The view describes a virtually contiguous subset of the dataset A JobClass view can express any arbitrary reorganization of the data Typically used for transposing data between sequential processing stages 1.0 1.1 1.n - 1 2.0 2.1 2.n Dataset D1.0 D1.1 D1.n D3.0 D3.1 D3.n D2.0 D2.1 D2.n JobClass Data Reorganization is Systematically Handled in GPM

GPM hierarchical program structure Agenda Graph Programming Model (GPM) goals Overview of GPM basics GPM hierarchical program structure A brief look at how GPM achieves efficiency Conclusion and Path Forward 9/22/2018

GPM: Hierarchical Program Structure Graphs are the fundamental building blocks Graphs are used as building blocks to construct a signal processing application Graphs are reusable DAGs that can be parameterized (eg number of collected samples to process, number of pulses to process, etc) SAR (Synthetic Aperture Radar ) signal processing example A single PtP (Pulse to Pulse) graph type is used for processing segments of a long coherent collection Batch Graph for integrating the PtP outputs where data dependences exist between the pulse to pulse graphs Multiple PtP Graphs maybe processed in parallel when collection time < processing time Graph are used to express a course gain inherent algorithm TLP

Subgraphs represent a finer gain expression of TLP GPM: Hierarchical Program Structure Graphs are decomposed into Subgraphs Subgraphs represent a finer gain expression of TLP Independent grouping of data and processing May execute in parallel thus utilizing parallel processing resources Provides an opportunity to manage processing latency Inter-subgraph communication implementation can utilize message passing Typical for a loosely coupled memory hierarchy

GPM: Hierarchical Program Structure Subgraphs are decomposed Into Jobclasses Subgraphs are composed into one or more Jobclasses Jobclasses are SPMD (shown on an earlier chart) that consume the DLP that exists in datasets A Jobclass consists of one or more Jobs Each Job may execute in parallel on a cluster of processing nodes allocated to the subGraph If a jobclass contains more Jobs then processing nodes than jobs will be queued and execute sequentially 9/22/2018

GPM: Hierarchical Program Structure Dataset are decomposed into Tiles Dataset is divided into tiles exposing Data Level Parallelism (DLP) A Jobclass view defines how to carve up the dataset into tiles Tiles represent virtually contiguous chunks of data that can be processed in parallel Typically there are 100s of tiles that comprise a dataset Tiles are grouped and assigned to Jobs Jobs process tiles in parallel A single job processes its tiles sequentially Job Class View Attributes Number of Dimensions Atomic Dimension Number of dimension required to access a tile Dimension structure Length Stride

A brief look at how GPM achieves efficiency Agenda Graph Programming Model (GPM) goals Overview of GPM basics GPM hierarchical program structure A brief look at how GPM achieves efficiency Conclusion and Path Forward 9/22/2018

GPM Achieves Efficiency: By Using Data tiling Ability to simultaneously move data and perform useful computation is key to high performance Job Sequencing and Data tiling provides the ability to bury data movement Spawns Graph Waits for inputs Graph Controls Middleware Calls Begin Queues Jobs Calls Compute0 Dataset Job 0 Calls Compute1 Calls Compute2 Tile 0 Tile 2 Input Buffers Calls Compute3 Tile 0 Dataset complete Calls End Tile 1 Tile 1 Tile 1 Tile 3 Tile 2 Tile 2 Tile 0 Output Buffers Tile 3 Tile 3 Tile 1 Tile 4 Job 1 Tile 5 Tile 6 Tile 4 Tile 6 Input Buffers Tile 5 Tile 7 Tile 5 Tile 7 Tile 6 Tile 4 Output Buffers Tile 7 Tile 5 9/22/2018

GPM Achieves Efficiency: Utilizing Heterogeneous Processing Presentation Title September 22, 2018 GPM Achieves Efficiency: Utilizing Heterogeneous Processing Heterogeneous processing transparently integrated into the GPM Achieving performance/SWAP requires the flexibility of a variety of processor types Multi-core GP GPGPUs Multi-core SIMD arrays FPGAs Key to transparent integration is a consistent API across processor types Exploiting TLP and DLP with a consistent interface Communication between processor types utilizing a consistent user abstraction GPM achieve this by parameterizing the processor type Graph description simple references a processor type Allows an implementation to allocate the processing step to the appropriate resource Signal Processing within a JobClass utilizes a custom library assigned processor type JobClass Description 9/22/2018 Speaker Name

GPM Achieves Efficiency: Utilizing The Optimal Transport Many signal processing problems are too large for a single SWAP constrained SMP machine thus requires a hybrid of SMP and message passing Partitioning a design to best utilizing the optimal transport is a function of the TLP available in an algorithm GMP exposes TLP is a systematic way that enables a runtime implementation to select the appropriate transport without impact to the signal processing design Intra-subgraph dataset utilize SMP communication Inter-subgraph dataset utilize message passing Graph Description 9/22/2018

GPM Achieves Efficiency: By Variety Of Other Mechanisms Coarse scale real time load balancing Coarse level load balancing is enable by use of subgraphs Subgraphs are independent so that they can be allocated in real time to the next available processing “Cluster” (Cluster is a group of SMP coupled machines) Fine scale real time load balancing Data processed by a JobClass is partitioned between Jobs which at run time are allocated to processing nodes Work can be evenly spread across all the nodes within a cluster Data locality control Since subgraphs are allocated to a cluster a runtime implementation can push dataset to the memory where the data will be consumed Dataset writes are executed in parallel with tile compute cycles so the data movement can buried Data Alignment Efficient use of modern processors is typically sensitive to alignment of data to a fundamental hardware architecture constraint (e.g. cache line size) GPM parameterizes this fundament constraint which enforce data movement between JobClasses and datasets to be optimized for the target hardware 9/22/2018

GPM Efficiency Measured Results Compute cycles and data IO transport time can be balanced thus reducing processing time verse unburied approaches Current runtime implementation measured over dozens of JobClasses showed a mix between memory bandwidth limited jobClasses and compute bound as a function of algorithmic processing step 10% - 75% of the I/O time or compute cycle time can typically be buried GPM implementation can achieve a low runtime overhead Typically 5% of total compute cycles Heterogeneous processing comparison Results have shown that by carving out a few processing intense algorithms Performance/SWAP can be improved by more then 2x FPGA vs IBM Cell processor example IQ formation (FIR) IQ Calibration (Cross Coupled FIRs) Pulse Compressing (Fast Convolution) Motion Compensation (Phase ramp gen and complex multiple) 9/22/2018

Conclusions and Path Forward Graph Programming Model achievement of goal can be directly associated with features of the architecture Discussed features of GPM that contribute to efficiency Portability, Scalability, Productivity features not discussed here are briefly described in the published paper Graph Programming Model: An Efficient Approach For Sensor Signal Processing Runtime implementation of GPM has been characterized and has shown excellent results Path Forward Add additional GPM features Data dependent cyclic behavior within a subgraph Continue to improve current implementation and development tool suite Performance model calculator Port runtime to multiple target platforms Addition of a GPGPU processor type Addition of a Graphical editor 9/22/2018