Garuda-PM July 2011 Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
1 A B C
Process Description and Control
AP STUDY SESSION 2.
1
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Processes and Operating Systems
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CALENDAR.
Agents & Intelligent Systems Dr Liz Black
Prasanna Pandit R. Govindarajan
Break Time Remaining 10:00.
Figure 12–1 Basic computer block diagram.
Turing Machines.
Database Performance Tuning and Query Optimization
Shredder GPU-Accelerated Incremental Storage and Computation
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
PP Test Review Sections 6-1 to 6-6
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Operating Systems Operating Systems - Winter 2012 Chapter 2 - Processes Vrije Universiteit Amsterdam.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc
Adding Up In Chunks.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Artificial Intelligence
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
SE-292 High Performance Computing
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Clock will move after 1 minute
Physics for Scientists & Engineers, 3rd Edition
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.
GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > DLR.de Chart 1 Chen Tang.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
GPU Programming using BU Shared Computing Cluster
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Supporting GPU Sharing in Cloud Environments with a Transparent
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
My Coordinates Office EM G.27 contact time:
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Linchuan Chen, Xin Huo and Gagan Agrawal
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Garuda-PM July 2011 Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc

Garuda-PM July 2011 © 2 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 3 Moores Law : Transistors

Garuda-PM July 2011 © 4 Moores Law : Performance Processor performance doubles every 1.5 years Processor performance doubles every 1.5 years

Garuda-PM July 2011 © 5 Overview Introduction Trends in Processor Architecture –Instruction Level Parallelism –Multi-Cores –Accelerators Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 6 Progress in Processor Architecture More transistors New architecture innovations –Multiple Instruction Issue processors VLIW Superscalar EPIC –More on-chip caches, multiple levels of cache hierarchy, speculative execution, … Era of Instruction Level Parallelism

Garuda-PM July 2011 © 7 Moores Law: Processor Architecture Roadmap (Pre-2000) First P Super- scalar EPIC RISC VLIW ILP Processors

Garuda-PM July 2011 © 8 Multicores : The Right Turn 6 GHz 1 Core 3 GHz 1 Core 1 GHz 1 Core Performance 3 GHz 16 Core 3 GHz 4 Core 3 GHz 2 Core

Garuda-PM July 2011 © 9 Era of Multicores (Post 2000) Multiple cores in a single die Early efforts utilized multiple cores for multiple programs Throughput oriented rather than speedup- oriented!

Garuda-PM July 2011 © 10 Moores Law: Processor Architecture Roadmap (Post-2000) First P RISC VLIW Super- scalar EPIC Multi- cores

Garuda-PM July 2011 © 11 Progress in Processor Architecture More transistors New architecture innovations –Multiple Instruction Issue processors –More on-chip caches –Multi cores –Heterogeneous cores and accelerators Graphics Processing Units (GPUs) Cell BE, Clearspeed Larrabee or SCC Reconfigurable accelerators … Era of Heterogeneous Accelerators

Garuda-PM July 2011 © 12 Moores Law: Processor Architecture Roadmap (Post-2000) First P RISC VLIW Super- scalar EPIC Multi- cores Accele- rators

Garuda-PM July 2011 © 13 Accelerators

Garuda-PM July 2011 © 14 Accelerators: Hype or Reality? Some Top500 Systems (June 2011 List) RankSystemDescription# Procs.R_max (TFLOPS) 2TianheXeon + Nvidia Tesla C2050 GPUs ,566 4Nebulae- Dawning Intel X5650, NVidia Tesla C2050 GPU 55, ,960 1,271 7TsubameXeon + Nvidia GPU732781,192 10RoadrunnerOpteron + CellBE ,105

Garuda-PM July 2011 © 15 Accelerators – Cell BE

Garuda-PM July 2011 © 16 Accelerator – Fermi S2050

Garuda-PM July 2011 © 17 How is the GPU connected? C0 Memory C1C2C3 IMC SLC

Garuda-PM July 2011 © 18 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 19 Handling the Multi-Core Challenge Shared and Distributed Memory Programming Languages –OpenMP –MPI Other Parallel Languages (partitioned global address space languages) –X10, UPC, Chapel, … Emergence of Programming Languages for GPU –CUDA –OpenCL © 19

Garuda-PM July 2011 © 20 GPU Programming: Good News Emergence of Programming Languages for GPU –CUDA –OpenCL – Open Standards Growing collection of code base –CUDAzone –Packages supporting GPUs by ISV Impressive performance –Yes! What about Programmer Productivity? © 20

Garuda-PM July 2011 © 21 GPU Programming: Boon or Bane Challenges in GPU programming –Managing parallelism across SMs and SPMD cores –Transfer of data between CPU and GPU –Managing CPU-GPU memory bandwidth efficiently –Efficient use of different level of memory (GPU memory, Shared Memory, Constant and Texture Memory, … –Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. –Identifying appropriate execution configuration for efficient execution –Synchronization across multiple SMs © 21

Garuda-PM July 2011 © 22 The Challenge : Data Parallelism Cg CUDA OpenCL AMD CAL

Garuda-PM July 2011 © 23 Challenge – Level 2 C0 Memory C1C2C3 MC SLC Synergistic Exec. on CPU & GPU Cores

Garuda-PM July 2011 © 24 Challenge – Level 3 Synergistic Exec. on Multiple Nodes Memory NIC N/W Switch Memory NIC Memory NIC

Garuda-PM July 2011 © 25 Data, Thread, & Task Level Parallelism MPI CUDA OpenCL OpenMP SSE CAL, PTX, …

Garuda-PM July 2011 © 26 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores –CUDA for Clusters Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 27 Synergistic Execution on Multiple Hetergeneous Cores Our Apprach Compiler/ Runtime System CellBE Other Accel. Multicores GPUs SSE Streaming Lang. MPI OpenMP CUDA/ OpenCL Parallel Lang. Array Lang. (Matlab)

Garuda-PM July 2011 © CUDA on Other Platforms? CUDA programs are data parallel Cluster of multi-cores, typically for task parallel applications CUDA for multi-core SMPs (M-CUDA) CUDA on Cluster of multi-nodes having multi- cores?

Garuda-PM July 2011 © 29 CUDA on Clusters: Motivation Motivation CUDA, a natural way to write data-parallel programs Large Code-base (CUDA Zone) Emergence of OpenCL How can we leverage these on (existing) Clusters without GPUs? Challenges : Mismatch in parallelism granularity Synchronization and context switch can be more expensive!

Garuda-PM July 2011 © CUDA on Multi-Cores (M-CUDA) Memory Fuses threads in a thread block to a thread OMP Parallel for executing thread blocks in parallel Proper synchronization across kernels Loop Transformation techniques for efficient code Shared memory and synchronization inherent!

Garuda-PM July 2011 © M-CUDA Code Translation Grid >>(params) #pragma omp parallel for foreach block do { perBlockCode(params, blockIdx, blockDim, gridDim); }

Garuda-PM July 2011 © CUDA-For-Clusters CUDA kernels naturally express parallelism –Block level and thread-level Multiple levels of parallelism in hardware too! –Across multiple nodes –And across multiple cores within a node A runtime system to perform this work partitioning at multiple granularities. Little communication between block (in data parallel programs) Need some mechanism for –Global Shared memory across multiple nodes! –Synchronization across nodes A 'lightweight' software distributed shared memory layer to provide necessary abstraction

Garuda-PM July 2011 © CUDA-For-Clusters Consistency only at kernel boundaries Relaxed consistency model for global data Lazy Update CUDA memory mgmt. functions are linked to the DSM functions. Runtime system to ensure partitioning of tasks across cores OpenMP and MPI CUDA program Source-to-source transforms C program with block-level code specification MPI + OpenMP work distribution runtime Software Distributed Shared Memory Interconnect P1 P2 RAM P1 P2 RAM P1 P2 RAM B0,B1B2,B3B4,B5 B6,B7 N1N1 N2N2 N3N3 N4N4 B0 B1B2B3 B4B5B6 B7 B0 B1B2B3 B4B5B6 B7 P1 P2 RAM

Garuda-PM July 2011 © Experimental Setup Benchmarks from Parboil and Nvidia CUDA SDK suite 8-node Xeon clusters, each node having Dual Socket Quad Core (2 x 4 cores) Baseline M-CUDA on single node (8-cores)

Garuda-PM July 2011 © Experimental Results With Lazy Update

Garuda-PM July 2011 © 36 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism –Stream Programs on GPU Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 37 © 37 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Synchronous Data Flow (SDF), Stream Flow Graph, StreamIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.

Garuda-PM July 2011 © 38 © 38 StreamIT Example Program Dup. Splitter Bandpass Filter + Amplifier Combiner Signal Source Bandpass Filter + Amplifier 2 – Band Equalizer

Garuda-PM July 2011 © 39 © 39 Challenges on GPUs Work distribution between the multiprocessors –GPUs have hundreds of processors (SMs and SIMD units)! Exploiting task-level and data-level parallelism –Scheduling across the multiprocessors –Multiple concurrent threads in SM to exploit DLP Determining the execution configuration (number of threads for each filter) that minimizes execution time. Lack of synchronization mechanisms between the multiprocessors of the GPU. Managing CPU-GPU memory bandwidth efficiently

Garuda-PM July 2011 © 40 © 40 Stream Graph Execution Stream Graph Software Pipelined Execution A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C Pipeline Parallelism Task Parallelism Data Parallelism

Garuda-PM July 2011 © 41 © 41 Multithreading –Identify good execution configuration to exploit the right amount of data parallelism Memory –Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Task Partition between GPU and CPU cores Work scheduling and processor (SM) assignment problem. –Takes into account communication bandwidth restrictions Our Approach

Garuda-PM July 2011 © 42 © 42 Compiler Framework Execute Profile Runs Generate Code for Profiling Configuration Selection StreamIt Program Task Partitioning Task Partitioning ILP Partitioner Heuristic Partitioner Instance Partitioning Instance Partitioning Modulo Scheduling Code Generation CUDA Code + C Code

Garuda-PM July 2011 © 43 © 43 Experimental Results on Tesla > 52x> 32x> 65x

Garuda-PM July 2011 © 44 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs –MATLAB on CPU/GPU Conclusions

Garuda-PM July 2011 © 45 Compiling MATLAB to Heterogeneous Machines MATLAB is an array language extensively used for scientific computation Expresses data parallelism –Well suited for acceleration on GPUs Current solutions (Jacket, GPUmat) require user annotation to identify GPU friendly regions Our compiler, MEGHA (MATLAB Execution on GPU-based Heterogeneous Architectures), is fully automatic

Garuda-PM July 2011 © 46 Compiler Overview Frontend constructs an SSA intermediate representation (IR) from the input MATLAB code Type inference is performed on the SSA IR –Needed because MATLAB is dynamically typed Backend identifies GPU friendly kernels, decides where to run them and inserts reqd. data transfers Code Simplification Type Inference SSA Construction Frontend Kernel Identification Mapping and Scheduling Data Transfer Insertion Code Generation Backend

Garuda-PM July 2011 © 47 Backend : Kernel Identification Kernel identification identifies sets of IR statements (kernels) on which mapping and scheduling decisions are made Modeled as a graph clustering problem Takes into account several costs and benefits while forming kernels –Register utilization (Cost) –Memory utilization (Cost) –Intermediate array elimination (Benefit) –Kernel invocation overhead reduction (Benefit)

Garuda-PM July 2011 © 48 Backend : Scheduling and Transfer Insertion Assignment and scheduling is performed using a heuristic algorithm –Assigns kernels to the processor which minimizes its completion time –Uses a variant of list scheduling Data transfers for dependencies within a basic block are inserted during scheduling Transfers for inter basic block dependencies are inserted using a combination of data flow analysis and edge splitting

Garuda-PM July 2011 © 49 Results Benchmarks were run on a machine with an Intel Xeon 2.83GHz quad-core processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3. MATLAB version 7.9 was used.

Garuda-PM July 2011 © 50 Overview Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs Conclusions

Garuda-PM July 2011 © 51 Summary Multi-cores and Heterogeneous accelerators present tremendous research opportunity in –Architecture –High Performance Computing –Compilers –Programming Languages & Models Challenge is to exploit Performance Keeping programmer Productivity high and Portability of code across platforms PPP is Important

Garuda-PM July 2011 Thank You !!