Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.

Slides:

Advertisements

Similar presentations

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Advertisements

Programmability Issues

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Introduction to Parallel Computing

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Data Parallel Algorithms Presented By: M.Mohsin Butt

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Department of Electrical Engineering National Cheng Kung University

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Harris Corner Detector on FPGA Rohit Banerjee Jared Choi : Parallel Computer Architecture and Programming.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

MACHINE VISION GROUP Graphics hardware accelerated panorama builder for mobile phones Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Markku.

Automatic Identification of Concurrency in Handel-C Joseph C Libby, Kenneth B Kent, Farnaz Gharibian Faculty of Computer Science University of New Brunswick.

Architectural Optimizations David Ojika March 27, 2014.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Implementing a Speech Recognition System on a GPU using CUDA

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

Accelerating image recognition on mobile devices using GPGPU

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

GPU Architecture and Programming

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

QCAdesigner – CUDA HPPS project

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

NFV Compute Acceleration APIs and Evaluation

Generalized and Hybrid Fast-ICA Implementation using GPU

Parallel Patterns.

CS427 Multicore Architecture and Parallel Computing

Enabling Effective Utilization of GPUs for Data Management Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Synchronization trade-offs in GPU implementations of Graph Algorithms

All-Pairs Shortest Paths

Compiler Code Optimizations

Implementation of a De-blocking Filter and Optimization in PLX

Synchronization These notes introduce:

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University of Tsukuba Japan

Table of contents 1.Research backgrounds Flow-model based programming Graphical programming on accelerators using flow-models 2.Finding an execution order 3.Parallelism Extraction Algorithm 4.Performance evaluation using manycore accelerators 5.Conclusions

Background – programming on manycore accelerators Programmer needs to write both programs for CPU and GPU. Accelerator is inserted to the peripheral bus of CPU (PCI Express). CPU executes the controlling program. CPU downloads kernel program to accelerator. Kernel program is executed. Download kernel Reading results CPU We need to make a story for mapping/ummapping the kernel programs to accelerator by the suitable order.

Flow-model based programming: Caravela platform 4 Design a CPU program that maps flow-model to accelerator Caravela Library Flow-model is stored in XML file Flow-model is executed design a flow- model Embed kernel program using DirectX, GLSL, CUDA, OpenCL Flow-model Advantages: Programmer focuses on designing flow-model Flow-model is treated like libraries for stream computing. Execution timing is automatically optimized.

Graphical programming on manycore accelerators Assigning manycore accelerators to flow-models and finding automatic execution flow? Optimized pipeline execution with concurrent execution?

Exploiting the execution order and parallelism from a pipeline flow Intuitively, these flow-models are executed in parallel. we assign multiple flow-models to available accelerators. Explicit Parallelism Intuitively we can know the execution order. we can assign an accelerator to the flow-model one by one. These two flow-models can be executed in parallel because the buffers are independently used. Implicit Parallelism

How can we exploit an execution order and the parallelism? How can we decide the execution order? Loop detection? How can we know the concurrent executable flow-models? Execution ordering Elimination of Buffer collision When we consider a continuous pipeline execution …

Research objective Graphical programming using flow-model needs Finding a deterministic execution order Extracting parallelism: Implicit and Explicit parallelism Automatic pipeline order is defined for optimized pipeline execution. We propose two algorithms: (1) Finding a deterministic execution order (2) Parallelism Extraction Algorithm

Strategy Finding a deterministic execution order Finding the first execution flow-model Parallelism Extraction Algorithm 1.Finding an execution order 2.Extracting the implicit parallelism 3.Extracting the Explicit parallelism Basic execution condition: When all input data inputs are ready, the flow-model can be executed.

Finding the first execution flow- model (Yamagiwa and Sousa, IJPEDS, 2008, world Scientific Pub.) [Step 1] Enumerating all cyclic paths from all nodes [Step 2] Sorting the cyclic paths by the number of nodes included in a path [Step 3] Reducing the cyclic paths to the minimum set [Results]

Parallelism Extraction Algorithm (PEA) 1.Defining the execution order by grouping three flow-models and the sub-graphs 2.Numbering 0, 1 and 2 to the groups 3.Finally, listing the flow-model with the same number in the execution list 4.Recursively repeating the operation above regarding the sub-grapghs

Grouping three flow-models and the sub- graphs Grouping sub-graphs of one or more flow-models Organizing the grapg into three sub-graphs

Numbering 0, 1 and 2 to the groups Numbering 0, 1 and 2 to the sub- graphs from the first executable flow-model 1 0 2

Listing the flow-model with the same number in the execution list Parallelism extraction from there sub-graphs

Recursively repeating the previous operations A B C D E C D E

Implementation of Parallelism Extraction Algorithm We introduce Execute matrix Ordering information saved in column Parallel flows (flow-model woth the same numbers) saved in row Serialize array Marks serialized pattern at every recursive iteration Batch matrix Pipeline execution is saved

Example: straight flow A B C D E C D E (0) Maximum parallelism is 3.

Example: flow with feedbacks Maximum parallelism is 2.

Performance Evaluation Image filtering 2D FFT High/Low pass filter 2D IFFT 13 flow-modes are included in the pipeline After IFFT2, the results are generated Using PEA Determining execution flow Extracting parallelism Executing on CarSh

CarSh: Commandline interface for manycore accelerators (Yamagiwa and Zhang, ICCS 2013) Exec/batch execAexecB Exec/batch execC Repeat for 3 times repeat 3 exec/batch execA & execB & Sync execC Background execution synchronization Processing flow CarSh batch

Applying PEA to the image filtering Maximum parallelism is 7.

Performance results OpenCL on CPU and GPU We measured Average time of the stage at every IFFT2 Speedup with/without parallelization CPU case: 4.9 times faster GPU case: 1.4 times faster

Conclusions and future direction Graphical programming for manycore accelerator Flow-model based programming needs Finding an execution flow Parallelism extraction in the pipeline flow Parallelism Extraction Algorithm Numbering 0, 1 and 2 to flow-models We are now implementing it on the GUI…

Eclipse plug-in for Caravela platform CarSh environment