Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
U of Houston – Clear Lake
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
A Dataflow Programming Language and Its Compiler for Streaming Systems
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
Router Architecture : Building high-performance routers Ian Pratt
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Reference: Message Passing Fundamentals.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Joint Minimization of Code and Data for Synchronous Dataflow Programs Kaushik Ravindran EE 249 – Presentation.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
L21: “Irregular” Graph Algorithms November 11, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Network Aware Resource Allocation in Distributed Clouds.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Resource Allocation in Network Virtualization Jie Wu Computer and Information Sciences Temple University.
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Michael I. Gordon, William Thies, and Saman Amarasinghe
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Review for E&CE Find the minimal cost spanning tree for the graph below (where Values on edges represent the costs). 3 Ans. 18.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Many-core Software Development Platforms
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language Group School of Information Technology The University of Sydney 1

Abstract Presents a partitioning and allocation algorithm for stream compiler  Targeting heterogeneous multiprocessors with constrained distributed memory and communication topology Introduce a novel definition of connectedness  Which enables the algorithm to model the capabilities of the compiler The algorithm uses convexity and connectedness constraints to produce partitions  Easier to compile and require short pipelines Shows StreamIt benchmarks results for an SMP, 2 X 2 mesh, SMP plus accelerator and IMP QS20 blade which are within 5% of the optimum 2

Motivation Recent trends: an increasing number of on-chip processor cores  Intel IXP2850 [16], TI OMAP [7], Nexperia Home Platform [9], and ST Nomadik Programs using explicit threads have to be tuned to match the target The interesting thing is to automatically map a portable program onto a heterogeneous multiprocessor Many multimedia and radio applications contain abundant task and data parallelism  But it is hard to extract from C code  Stream languages represent the application as independent actors communicating via point-to-point streams 3

Overview The input to the partitioning algorithm is the program graph of kernels and streams The output is the mapping to fuse kernels to tasks, and to allocate tasks to processors The difference between kernels and tasks,  Kernels: the work function of an actor  Tasks: which are present in the executable, contain multiple kernels and are scheduled on a known processor The algorithm requires  The average data rate on each stream, and (SDF: statically)  The average load of each kernel on each processor (profiling) 4

The ACOTES Stream Compiler This work is part of the ACOTES European project  Developing an open source stream compiler for embedded systems  Will automatically map a stream program onto a multicore system  Program will be written in SPM: an annotated C programming language  The target is represented using the ASM: supports heterogeneous platforms

The Mapping Phase Blocking and splitting Partitioning Software pipelining and queue length assignment Initial Partitioning Merge tasks Move bottlenecks Create tasks Reallocate tasks A polyhedral optimization pass unrolls to aggregate computation and communication, and splits stateless kernels for greater parallelism Partitioning algorithm described in this paper Performs software pipelining and allocates memory for the stream buffers

Convex Connected Partitions Convexity: a partition is convex if the graph of dependencies between tasks is acyclic Equivalently, every directed path between two kernels in the same task is internal to that task The convexity constraint is for  Avoiding long software pipelines Require long pipelines for a small increase in throughput (unaware of pipelining cost)

8 Quick Review: Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies

The number of pipeline stages

Convex Partition Data flow is from processor p1 (black) to p2 (grey) and p3 (white), and from p2 to p3—an acyclic graph

Connectedness The connectedness constraint  To help code generation  It is easier to fuse adjacent kernels If k2 and k3 are fused into one task, then the entire graph must be fused

Connectedness Contd. A naïve definition: considers a partition to be connected when each processor has a weakly connected subgraph Unfortunately, wide split-joins, as in filterbank, do not usually have good partitions subject to this constraint Because strict connectedness Fig (c) performs 28% worse than Fig (a) Generalize connectivity as a set of basic connected sets [Joseph’06]

Formalization of the Problem The target is represented as an undirected bipartite graph H = (V, E), where is the set of vertices, a disjoint union of processors, P, and interconnects, I and E is the set of edges Processor weight: w p Clock speed in GHz Interconnect weight: w u is bandwidth in GB/s Static route between processor p and q:, if it uses interconnect u, and 0 otherwise In general, 14

Topology of the Targets 15 Processors a and b unable to execute code; therefore = w a = w b = 0

Formalization of the Problem Contd. The program is represented as a directed acyclic graph, G = (K, S), where K is the set of kernels, and S is the set of streams Load of kernel i on processor p, denoted c ip, is the mean number of gigacycles in some fixed time period tau Load of stream ij, denoted c ij is the mean number of gigabytes transferred in time tau 16

Formalization of the Problem Contd. The output of the algorithm is two map functions  Firstly, T maps kernels onto tasks, and  Secondly, P maps tasks onto processors The partition implied by T must be convex, so the graph of dependencies between tasks is acyclic 17

Formalization of the Problem Contd. The cost on processor p or interconnect u is The goal is to find the allocation (T, P), which minimises the maximum values of all the C p and C u  Subject to the convexity and connectedness constraints 18

Partitioning the Target The algorithm  First divides the target into two subgraphs, P1 and P2, and  An aggregate interconnect, I, balancing two objectives: The subgraphs should have roughly equal total CPU performance, and The aggregate interconnect bandwidth between them should be low 19

Partitioning the Target Cond. Communications bottleneck The target is divided into halves to maximise alpha 20 They find an approximate solution using a variant of the Kernighan and Lin partitioning algorithm

Partitioning the Program The program graph is given edge and vertex weights The edge weight for stream ij, denoted c ij is the cost in cycles in time tau, if assigned to the aggregate interconnect The vertex weight for kernel i is a pair (c iP 1, c iP 2 ), The goal is to find a two-way partition {K 1,K 2 } to minimize the bottleneck given by: 21

Partitioning the Program Contd. The partitioning algorithm is a branch and bound search Each node in the search tree inherits a partial partition (K 1, K 2 ), and unassigned vertices X; at the root K 1 = K 2 = 0 and X = K The minimal cost, c K 1 K 2, for all partitions in the subtree rooted at node (K 1,K 2 ) 22

The Branch and Bound Search 23

Refinement of the Partition The refinement stage starts with a valid initial partition  Merge tasks  Move bottlenecks  Create tasks  Reallocate tasks 24

Evaluation 25

Convergence of the Refinement Phase 26

Summary A fast and robust partitioning algorithm for an iterative stream compiler The algorithm maps an unstructured variable data rate stream program onto a heterogeneous multiprocessor system with any communications topology The algorithm favours convex connected partitions, which do not require software pipelining and are easy to compile The performance is, on average, within 5% of the optimum performance 27

Question? Thank you 28