University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Parallelizing GIS applications for IBM Cell Broadband engine and x86 Multicore platforms Bharghava R, Jyothish Soman, K S Rajan International.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Scalable Low-power Architecture For Software Radio

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

SAGE: Self-Tuning Approximation for Graphics Engines

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

High Performance Linear Transform Program Generation for the Cell BE

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Mattan Erez The University of Texas at Austin

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Electrical Engineering and Computer Science 2 Courtesy: Gordon’06 Cores are the New Gates # cores/chip (Shekhar Borkar, Intel) Courtesy: Gordon’06 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming

University of Michigan Electrical Engineering and Computer Science 3 Stream Programming Programming style –Embedded domain Audio/video (H.264), wireless (WCDMA) –Mainstream Continuous query processing (IBM SystemS), Search (Google Sawzall) Stream –Collection of data records Kernels/Filters –Functions applied to streams –Input/Output are streams –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]

University of Michigan Electrical Engineering and Computer Science 4 Compiling Stream Programs Core 1Core 2Core 3Core 4 Mem ? Stream ProgramMulticore System Heavy lifting Equal work distribution Communication Synchronization

University of Michigan Electrical Engineering and Computer Science 5 Stream Graph Modulo Scheduling(SGMS) Coarse grain software pipelining –Equal work distribution –Communication/computation overlap Target : Cell processor –Cores with disjoint address spaces –Explicit copy to access remote data DMA engine independent of PEs Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

University of Michigan Electrical Engineering and Computer Science 6 Streamroller Software Overview Stream IR Profiling Analysis Scheduling Code Generation Streamroller StreamIt SPEX (C++ with stream extensions) Stylized C Custom Application Engine SODA (Low power multicore processor for SDR) Multicore (Cell, Core2Quad, Niagara) Focus of this talk

University of Michigan Electrical Engineering and Computer Science 7 Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful

University of Michigan Electrical Engineering and Computer Science 8 SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue

University of Michigan Electrical Engineering and Computer Science 9 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

University of Michigan Electrical Engineering and Computer Science 10 Processor Assignment Assign filters to processors –Goal : Equal work distribution Graph partitioning? Bin packing? A B D C A BC D Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2

University of Michigan Electrical Engineering and Computer Science 11 Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?

University of Michigan Electrical Engineering and Computer Science 12 Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in Objective function- Maximal load on any PE –Minimize Result –Number of times to “split” each filter –Filter → processor mapping

University of Michigan Electrical Engineering and Computer Science 13 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

University of Michigan Electrical Engineering and Computer Science 14 Forming the Software Pipeline To achieve speedup –All chunks should execute concurrently –Communication should be overlapped Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X

University of Michigan Electrical Engineering and Computer Science 15 Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap Data flow traversal of the stream graph –Assign stages using above two rules

University of Michigan Electrical Engineering and Computer Science 16 Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1

University of Michigan Electrical Engineering and Computer Science 17 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

University of Michigan Electrical Engineering and Computer Science 18 Code Generation for Cell Target the Synergistic Processing Elements (SPEs) –PS3 – up to 6 SPEs –QS20 – up to 16 SPEs One thread / SPE Challenge –Making a collection of independent threads implement a software pipeline –Adapt kernel-only code schema of a modulo schedule

University of Michigan Electrical Engineering and Computer Science 19 Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time

University of Michigan Electrical Engineering and Computer Science 20 Experiments StreamIt benchmarks –Signal processing – DCT, FFT, FMRadio, radar –MPEG2 decoder subset –Encryption – DES –Parallel sort – Bitonic –Range of 30 to 90 filters/benchmark Platform –QS20 blade server – 2 Cell chips, 16 SPEs Software –StreamIt to C – gcc 4.1 –IBM Cell SDK 2.1

University of Michigan Electrical Engineering and Computer Science 21 Evaluation on QS20 Split/join overhead reduces benefit from fission Barrier synchronization (1 per iteration of the stream graph)

University of Michigan Electrical Engineering and Computer Science 22 SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors

University of Michigan Electrical Engineering and Computer Science 23 Conclusions Streamroller –Efficient mapping of stream programs to multicore –Coarse grain software pipelining Performance summary –14.7x speedup on 16 cores –Up to 35% better than greedy solution (11% on average) Scheduling framework –Tradeoff memory space vs. load balance Memory constrained (embedded) systems Cache based system

University of Michigan Electrical Engineering and Computer Science 24

University of Michigan Electrical Engineering and Computer Science 25

University of Michigan Electrical Engineering and Computer Science 26 PE4PE3PE2 Naïve Unfolding A BC DE F A BC DE F A BC DE F A BC DE F PE1 Completely stateless stream program Stream program with stateful nodes Stream data dependence State data dependence A BC DE F A BC DE F A BC DE F A BC DE F PE1PE2PE3PE4 DMA

University of Michigan Electrical Engineering and Computer Science 27 SGMS (ILP) vs. Greedy Solver time < 30 seconds for 16 processors Highly dependent on graph structure DMA overlapped in both ILP and Greedy Speedup drops with exposed DMA (MIT method, ASPLOS’06)

University of Michigan Electrical Engineering and Computer Science 28 Unfolding vs. SGMS

University of Michigan Electrical Engineering and Computer Science 29 Pros and Cons of Unfolding Pros –Simple –Good speedups for mostly stateless programs Cons A B C D Sync + DMA A B C D A B C D A B C D A B C D A B C D PE 1PE 2PE 3 A BC D All input data need to be available for iterations to begin Long latency for one iteration Not suitable for real time scheduling Sync+DMA Speedup affected by filters with large state

University of Michigan Electrical Engineering and Computer Science 30 SGMS Assumptions Data independent filter work functions –Static schedule High bandwidth interconnect –EIB 96 Bytes/cycle –Low overhead DMA, low observed latency Cores can run MIMD efficiently –Not (directly) suitable for GPUs

University of Michigan Electrical Engineering and Computer Science 31 Speedup Upperbounds 15% work in one stateless filter Max speedup = 100/15 = 6.7

University of Michigan Electrical Engineering and Computer Science 32 Summary 14.7x speedup on 16 cores –Preprocessing steps (fission) necessary –Hiding communication important Extensions –Scaling to more cores –Constrained memory system Other targets –SODA, FPGA, GPU

University of Michigan Electrical Engineering and Computer Science 33 Stream Programming Algorithmic style suitable for –Audio/video processing –Signal processing –Networking (packet processing, encryption/decryption) –Wireless domain, software defined radio Characteristics –Independent filters Segregated address space Independent threads of control –Explicit communication –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03,’08] Motion decode 1D-DCT (row) Bounded saturate Inverse Quant AC Inverse Quant DC Saturate 1D-DCT (column) Bitstream parser Zigzag unorder MPEG Decoder

University of Michigan Electrical Engineering and Computer Science 34 Effect of Exposed DMA

University of Michigan Electrical Engineering and Computer Science 35 Scheduling via Unfolding Evaluation IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server StreamIt benchmarks characteristics Benchmark Total filtersStatefulPeeking State size (bytes) bitonic channel dct des fft filterbank fmradio tde mpeg vocoder radar

University of Michigan Electrical Engineering and Computer Science 36 Absolute Performance Benchmark GFLOPS with 16 SPEs bitonicN/A channel dct desN/A fft filterbank fmradio tde mpeg2N/A vocoder radar

University of Michigan Electrical Engineering and Computer Science 37 0_0 a 1,0,0,i – b 1,0 = 0 Σ i=1 P 1_01_1 1_2 1_3 Σ i=1 P Σ j=0 3 a 1,1,j,i – b 1,1 ≤ Mb 1,1 Σ i=1 P Σ j=0 3 a 1,1,j,i – b 1,1 – 2 ≥ -M + Mb 1,1 2_02_12_2 2_3 2_4 Σ i=1 P Σ j=0 4 a 1,2,j,i – b 1,2 ≤ Mb 1,2 Σ i=1 P Σ j=0 4 a 1,2,j,i – b 1,2 – 3 ≥ -M + Mb 1,2 b 1,0 + b 1,1 + b 1,2 = 1 Original actor Fissed 2x Fissed 3x

University of Michigan Electrical Engineering and Computer Science 38 SPE Code Template void spe_work() { char stage[N] = {0, 0,...,0}; stage[0] = 1; for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) { }... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); } Bit mask to control active stages Activate Stage 0 Left shift bit mask (activate more stages) Start DMA operation Call filter work function Poll for completion of all outstanding DMAs Barrier synchronization Bit mask controls what gets executed Go through all input items