INSPIRE The Insieme Parallel Intermediate Representation Herbert Jordan, Peter Thoman, Simone Pellegrini, Klaus Kofler, and Thomas Fahringer University.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

INSPIRE The Insieme Parallel Intermediate Representation Herbert Jordan, Peter Thoman, Simone Pellegrini, Klaus Kofler, and Thomas Fahringer University.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Presented by Rengan Xu LCPC /16/2014

Intermediate code generation. Code Generation Create linear representation of program Result can be machine code, assembly code, code for an abstract.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Parallel Programming in Java with Shared Memory Directives.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Computing with C# and the.NET Framework Chapter 1 An Introduction to Computing with C# ©2003, 2011 Art Gittleman.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Introduction to OpenMP

DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Sep 11, 2009 Automatic Transformation of Applications onto GPUs and GPU Clusters PhD Candidacy presentation: Wenjing Ma Advisor: Dr Gagan Agrawal The.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.

My Coordinates Office EM G.27 contact time:

Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

1 Lecture 5a: CPU architecture 101 boris.

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.

Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.

Chapter 1 Introduction.

Shared Memory Parallelism - OpenMP

Conception of parallel algorithms

CS427 Multicore Architecture and Parallel Computing

Chapter 1 Introduction.

Optimization Code Optimization ©SoftMoore Consulting.

Compiler Back End Panel

Compiler Back End Panel

Information Security - 2

Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.

CUDA Fortran Programming with the IBM XL Fortran Compiler

Presentation transcript:

INSPIRE The Insieme Parallel Intermediate Representation Herbert Jordan, Peter Thoman, Simone Pellegrini, Klaus Kofler, and Thomas Fahringer University of Innsbruck PACT’ September

Programming Models void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ HW: C Memory User:

Programming Models void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ HW: C Memory User:Compiler:.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization

Programming Models void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ HW:User:Compiler: C Memory $.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization loops & latency

Programming Models void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ HW:User:Compiler: C Memory $.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization loops & latency 10 year old architecture

Parallel Architectures Memory $ CC CC Multicore:Accelerators: M $ C G M Clusters: M $ C M $ C M $ C OpenMP/CilkOpenCL/CUDAMPI/PGAS

Compiler Support void main(…) { int sum = 0; #omp pfor for(i = 1..10) sum += i; } C / C++.START ST ST: MOV R1,#2 MOV R2,#1 _GOMP_PFOR M1: CMP R2,#20 BGT M2 MUL R1,R2 INI IR pfor: mov eax,-2 cmp eax, 2 xor eax, eax... lib Start: mov eax,2 mov ebx,1 call “pfor” Label 1: lea esi, Str push esi bin FrontendBackend sequential

Situation  Compilers  unaware of thread-level parallelism  magic happens in libraries  Libraries  limited perspective / scope  no static analysis, no transformations  User?  has to coordinate complex system  no performance portability

Parallel Architectures Memory $ CC CC Multicore:Accelerators: M $ C G M Clusters: M $ C M $ C M $ C OpenMP/CilkOpenCL/CUDAMPI/PGAS

Real World Architectures M $ CCCC G M M $ CC CC G M M $ CC CC G M M $ CC CC G M M $ CC CC G M Embedded - Desktop> Desktop

Real World Architectures M $ CC CC G M M $ CC CC G M M $ CC CC G M M $ CC CC G M Memory $ CC CC M $ C G M M $ C M $ C M $ C OpenCL/CUDA MPI/PGAS OpenMP/Cilk

Approaches  Option A: pick one API and settle  OpenMP / OpenCL / MPI  no full resource utilization  still not trivial  Option B: hybrid approach  combine APIs  multiple levels of parallelism  user has to coordinate APIs C / C++ Architecture SIMD Ref. pthreads OpenMP Cuda OpenCL MPI

Situation (con’t)  Compilers  unaware of thread-level parallelism  magic happens in libraries  Libraries  limited perspective / scope  no static analysis, no transformations  User  has to manage and coordinate parallelism  no performance portability

Situation  Compilers  unaware of thread-level parallelism  magic happens in libraries  Libraries  limited perspective / scope  no static analysis, no transformations  User  has to manage and coordinate parallelism  no performance portability

Compiler Support? M $ CC CC G M M $ CC CC G M HW:Compiler:User:.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization loops & latency vectorization void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ M $ CC CC G M M $ CC CC G M

M $ CC CC G M M $ CC CC G M Our approach: HW:Compiler:User:.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization loops & latency vectorization void main(…) { int sum = 0; for(i = 1..10) sum += i; print(sum); } C / C++ M $ CC CC G M M $ CC CC G M

Our approach: Insieme User: Compiler:.START ST ST: MOV R1,#2 MOV R2,#1 M1: CMP R2,#20 BGT M2 MUL R1,R2 INI R2 JMP M1 IR PL  Assembly instruction selection register allocation optimization loops & latency vectorization PL  PL + extras coordinate parallelism high-level optimization auto tuning instrumentation void main(…) { int sum = 0; #omp pfor for(i = 1..10) sum += i; } C / C++ M $ CC CC G M M $ CC CC G M HW: M $ CC CC G M M $ CC CC G M Insieme: unit main(...) { ref v1 =0; pfor(..., (){... }); } INSPIRE

 Goal: to establish a research platform for hybrid, thread level parallelism The Insieme Project Frontend Runtime System Events FrontendBackend INSPIRE Static Optimizer IR Toolbox IRSM C/C++ OpenMP Cilk OpenCL MPI and extensions Compiler Dyn. Optimizer Scheduler Monitoring Exec. Engine Runtime

Parallel Programming  OpenMP  Pragmas (+ API)  Cilk  Keywords  MPI  library  OpenCL  library + JIT Objective: combine those using a unified formalism and to provide an infrastructure for analysis and manipulations

INSPIRE Requirements OpenMP / Cilk / OpenCL / MPI / others INSPIRE complete unified explicit analyzable transformable compact high level whole program open system extensible OpenCL / MPI / Insieme Runtime / others

INSPIRE  Functional Basis  first-class functions and closures  generic (function) types  program = 1 expression  Imperative Constructs  loops, conditions, mutable state  Explicit Parallel Constructs  to model parallel control flow

Parallel Model

Parallel Model (2)

Simple Example #pragma omp parallel for for(int i=0; i<10; i++) { a[i] = b[i]; } merge(parallel(job([1-inf]) () => (ref [10]> v30, ref [10]> v31) -> unit { pfor(0, 10, 1, (v23, v24, v25) => (ref [10]> v18, ref [10]> v19, int v20, int v21, int v22) -> unit { for(int v12 = v20.. v21 : v22) { v18[v12] := v19[v12]; }; } (v30, v31, v23, v24, v25) ); barrier(); } (v1, v10) )); OpenMP INSPIRE

Simple Example #pragma omp parallel for for(int i=0; i<10; i++) { a[i] = b[i]; } merge(parallel(job([1-inf]) () => (ref [10]> v30, ref [10]> v31) -> unit { pfor(0, 10, 1, (v23, v24, v25) => (ref [10]> v18, ref [10]> v19, int v20, int v21, int v22) -> unit { for(int v12 = v20.. v21 : v22) { v18[v12] := v19[v12]; }; } (v30, v31, v23, v24, v25) ); redistribute(unit, (unit[] data) -> unit { return unit; }); } (v1, v10) )); OpenMP INSPIRE

Overall Structure  Structural – opposed to nominal systems  full program is a single expression

Overall Structure  Consequences:  every entity is self-contained  no global lookup- tables or translation units  local modifications are isolated  Also:  IR models execution – not code structure  path in tree = call context (!)

Evaluation  What inherent impact does the INSPIRE detour impose? FEBE INSPIRE C Input Code Insieme Compiler Binary A (GCC) Binary B (Insieme) GCC (-O3) Target Code (IRT) C No Optimization!

Performance Impact

Derived Work (subset)  Adaptive Task Granularity Control P. Thoman, H. Jordan, T. Fahringer, Adaptive Granularity Control in Task Parallel Programs using Multiversioning, EuroPar 2013  Declarative Transformation System H. Jordan, P. Thoman, T. Fahringer, A high-level IR Transformation System, PROPER 2013  Multiobjective Auto-Tuning H. Jordan, P. Thoman, J. J. Durillo et al., A Multi-Objective Auto-Tuning Framework for Parallel Codes, SC 2012  Compiler aided Loop Scheduling P. Thoman, H. Jordan, S. Pellegrini et al., Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach, IWOMP 2012  OpenCL Kernel Partitioning K. Kofler, I. Grasso, B. Cosenza, T. Fahringer, An Automatic Input-Sensitive Approach for Heterogeneous Task Partitioning, ICS 2013  Improved usage of MPI Primitives S. Pellegrini, T. Hoefler, T. Fahringer, On the Effects of CPU Caches on MPI Point-to-Point Communications, Cluster 2012

Derived Work (subset)  Adaptive Task Granularity Control P. Thoman, H. Jordan, T. Fahringer, Adaptive Granularity Control in Task Parallel Programs using Multiversioning, EuroPar 2013  Multiobjective Auto-Tuning H. Jordan, P. Thoman, J. J. Durillo et al., A Multi-Objective Auto-Tuning Framework for Parallel Codes, SC 2012  Compiler aided Loop Scheduling P. Thoman, H. Jordan, S. Pellegrini et al., Automatic OpenMP Loop Scheduling: A Combined Compiler and Runtime Approach, IWOMP 2012  OpenCL Kernel Partitioning K. Kofler, I. Grasso, B. Cosenza, T. Fahringer, An Automatic Input-Sensitive Approach for Heterogeneous Task Partitioning, ICS 2013  Improved usage of MPI Primitives S. Pellegrini, T. Hoefler, T. Fahringer, On the Effects of CPU Caches on MPI Point-to-Point Communications, Cluster 2012

Conclusion  INSPIRE is designed to  represent and unify parallel applications  to analyze and manipulate parallel codes  provide the foundation for researching parallel language extensions  based on comprehensive parallel model  sufficient to cover leading standards for parallel programming  Practicality has been demonstrated by a variety of derived work

Thank You! Visit: Contact:

Types  7 type constructors:

Expressions  8 kind of expressions:

Statements  9 types of statements: