Code Generation PetaQCD, Orsay, 2009 January 20th-21st.

Slides:



Advertisements
Similar presentations
Houssam Haitof Technische Univeristät München
Advertisements

Semantic Analysis Chapter 6. Two Flavors  Static (done during compile time) –C –Ada  Dynamic (done during run time) –LISP –Smalltalk  Optimization.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To understand the structure of a C-language program. ❏ To write your first C.
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction Tom Henderson NOAA Global Systems Division Mark Govett,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
1 3-Software Design Basics in Embedded Systems. 2 Development Environment Development processor  The processor on which we write and debug our programs.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Beam Dynamic Calculation by NVIDIA® CUDA Technology E. Perepelkin, V. Smirnov, and S. Vorozhtsov JINR, Dubna 7 July 2009.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
LANGUAGE TRANSLATORS: WEEK 24 TRANSLATION TO ‘INTERMEDIATE’ CODE (overview) Labs this week: Tutorial Exercises on Code Generation.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
GPU Architecture and Programming
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
1 4-Development Environment Development processor  The processor on which we write and debug our programs Usually a PC Target processor  The processor.
Software solutions for challenges in embedded systems Sri Hari Krishna Narayanan, The Pennsylvania State University, USA, Theme While.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Getting Started With Java September 22, Java Bytecode  Bytecode : is a highly optimized set of instructions designed to be executed by the Java.
A Survey on Runtime Smashed Stack Detection 坂井研究室 M 豊島隆志.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
CS429 Computer Architecture Topics Simple C program Basic structure, functions, separate files Compilation Phases, options Assembler GNU style, byte ordering,
Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.
Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec , KTH, Stockholm.
Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.
Computer Engg, IIT(BHU)
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Chapter 3 General-Purpose Processors: Software
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Chapter 1 Introduction.
CS427 Multicore Architecture and Parallel Computing
Introduction to the C Language
Chapter 1 Introduction.
Performance Tuning Team Chia-heng Tu June 30, 2009
Introduction to C Programming Language
Choice of Programming Language
Engineering Innovation Center
Introduction to the C Language
Introduction to the C Language
Dycore Rewrite Tobias Gysi.
Performance Optimization for Embedded Software
Compiler Back End Panel
CS/EE 217 – GPU Architecture and Parallel Programming
Register Pressure Guided Unroll-and-Jam
Compiler Back End Panel
“C” and Assembly Language- What are they good for?
Allen D. Malony Computer & Information Science Department
Semantic Analysis Chapter 6.
Introduction to Computer Systems
Introduction to CUDA.
Mapping DSP algorithms to a general purpose out-of-order processor
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CUDA Fortran Programming with the IBM XL Fortran Compiler
Chapter 11 Introduction to Programming in C
Presentation transcript:

Code Generation PetaQCD, Orsay, 2009 January 20th-21st

CAPS Code Generation CAPS in PetaQCD HMPP in a nutshell Directives for Hardware Accelerators (HWA) HMPP Code Generation Capabilities Code generation for GPU (CUDA, …) Code generation for Cell (Available end Q1 2009) Code Tuning CAPS Tuner PetaQCD, Orsay, 2009/01/20-21

Company Profile Founded in 2002 Spin off of two research institutes 25 staff Spin off of two research institutes INRIA: processor microarchitecture and code generation techniques expertise UVSQ: micro-benchmarking and architecture behavior expertise Mission: help our customers to efficiently build parallel applications Integration services and expertise Software tools (code generation and runtime) Customer references: Intel, Total, CEA, EADS, Bull, … R&D projects: POPS (System@tic), Milepost (IST), QCDNext (ANR) , PARA (ANR), … PRACE, 2008 September 16th

CAPS in PetaQCD Provide code generation tools for Hardware Accelerators (HWA) Highly optimized for LQCD Based on a data parallel intermediate form (provided from a higher problem description language) Mix of GPcore and HWA (hybrid parallelism) Provide iterative compilation techniques Code tuning via optimization space exploration CAPS specific deliverables D2.2 Decision/code generation process at run-time, and definition of a data parallel intermediate language to describe QCD applications D2.3 Dynamic techniques for code generation PetaQCD, Orsay, 2009/01/20-21

PetaQCD, Orsay, 2009/01/20-21

Simple C Example To be executed on the HWA #include <stdio.h> #include <stdlib.h> #pragma hmpp simple codelet, args[1].io=out void simplefunc(int n, float v1[n], float v2[n], float v3[n], float alpha) { int i; for (i = 0 ; i< n ; i++) { v1[i] = v2[i] * v3[i] + alpha; } int main(int argc, char **argv) { unsigned int n = 400; float t1[400], t2[400], t3[400]; float alpha = 1.56; unsigned int j, seed = 2; /* Initialization of input data*/ /* . . . */ #pragma hmpp simple callsite simplefunc(n,t1,t2,t3,alpha); printf("%f %f (...) %f %f \n", t1[0], t1[1], t1[n-2], t1[n-1]); return 0; codelet / callsite directive set PetaQCD, Orsay, 2009/01/20-21

Codelet Generation PetaQCD, Orsay, 2009/01/20-21

Objectives Allow to transparently use HWA From C or Fortran, Java to CUDA, Brook, … Allow for code tuning at source code level Directives based approach PetaQCD, Orsay, 2009/01/20-21

Code Generation Flow PetaQCD, Orsay, 2009/01/20-21

Codelet Generation C, Java or Fortran source code input HWA oriented subset of the languages Set of directives to Optimize target codelet generation Express parallelism expression Make code tuning easier Generated code can also be tuned PetaQCD, Orsay, 2009/01/20-21

Loop Parallelization Force or prevent the parallelization of loops Help defining kernels in a codelet #pragma hmppcg parallel for (i=0; i < n; i++) { #pragma hmppcg noParallel for (j=0; j < n; j++) { D[i][j] = A[i][j] * E[3][j]; } PetaQCD, Orsay, 2009/01/20-21

Input C Code Example 1 typedef struct{ float r, i;} Complex; #pragma hmpp convolution2d codelet, args[data; opx].io=in, args[convr].io=out, target=CUDA void convolution2d( Complex *data, int nx, int ny, Complex *opx, int oplx, int oply, Complex *convr) { int hoplx = (oplx+1)/2; int hoply = (oply+1)/2; int iy, ix; #pragma hmppcg parallel for (iy = 0; iy < ny; iy++) { for (ix = 0; ix < nx; ix++) { float dumr =0.0, dumi = 0.0; int ky; for(ky = -(oply - hoply - 1); ky <= hoply; ky++) { int kx; for(kx = -(oplx - hoplx - 1); kx <= hoplx; kx++){ int dx = min( max(ix+kx, 0), (nx - 1) ); int dy = min( max(iy+ky, 0), (ny - 1) ); dumr += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].r; dumr -= data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].r; } convr[iy*nx+ix].r = dumr; convr[iy*nx+ix].i = dumi; PetaQCD, Orsay, 2009/01/20-21

Input Fortran Code Example 2 !$HMPP sgemm3 codelet, target=CUDA, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout)   INTEGER, INTENT(IN)    :: m,n,k2   REAL,   INTENT(IN)    :: alpha,beta   REAL,    INTENT(IN)    :: vin1(n,n), vin2(n,n)   REAL,    INTENT(INOUT) :: vout(n,n)   REAL     :: prod   INTEGER  :: i,j,k !$HMPPCG unroll(8), jam(2), noremainder   !$HMPPCG parallel   DO j=1,n !$HMPPCG unroll(8), splitted, noremainder      !$HMPPCG parallel     DO i=1,n         prod = 0.0         DO k=1,n            prod = prod + vin1(i,k) * vin2(k,j)         ENDDO         vout(i,j) = alpha * prod + beta * vout(i,j) ;      END DO   END DO END SUBROUTINE sgemm PetaQCD, Orsay, 2009/01/20-21

MxM Performance PetaQCD, Orsay, 2009/01/20-21

Performance Examples PetaQCD, Orsay, 2009/01/20-21

Codelet Tuning PetaQCD, Orsay, 2009/01/20-21

Codelet Tuning (1) Based on CAPSTuner Technology Platform independent Iterative compilation technique to explore the optimization space Explore source code transformations Via a set of code transformation directives such as unroll-and-jam Explore compiler options Store the performance data in a well defined repository PetaQCD, Orsay, 2009/01/20-21

Code Tuning (2) PetaQCD, Orsay, 2009/01/20-21