Generic Code Optimization Jack Dongarra, Shirley Moore, Keith Seymour, and Haihang You LACSI Symposium Automatic Tuning of Whole Applications Workshop.

Slides:



Advertisements
Similar presentations
Continuation of chapter 6…. Nested while loop A while loop used within another while loop is called nested while loop. Q. An illustration to generate.
Advertisements

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
The view from space Last weekend in Los Angeles, a few miles from my apartment…
Nelder Mead.
Introduction to arrays
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Locality CS 524 – High-Performance Computing.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.
Recap from last time We saw various different issues related to program analysis and program transformations You were not expected to know all of these.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 CASC ROSE Compiler Infrastructure Source-to-Source Analysis and Optimization Dan Quinlan Rich Vuduc, Qing Yi, Markus Schordan Center for Applied Scientific.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
A.Abhari CPS1251 Multidimensional Arrays Multidimensional array is the array with two or more dimensions. For example: char box [3] [3] defines a two-dimensional.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Data Strcutures.
Arrays and 2D Arrays.  A Variable Array stores a set of variables that each have the same name and are all of the same type.  Member/Element – variable.
1 Chapter 3 Arrays (2) 1. Array Referencing 2. Common Operations 1. Slicing 2. Diminution 3. Augmentation 3. List of Commonly Used Built-in Functions 1.
1 Simplex Method for Bounded Variables Linear programming problems with lower and upper bounds Generalizing simplex algorithm for bounded variables Reference:
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Cache-oblivious Programming. Story so far We have studied cache optimizations for array programs –Main transformations: loop interchange, loop tiling.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
An Experimental Comparison of Empirical and Model-based Optimization Kamen Yotov Cornell University Joint work with: Xiaoming Li 1, Gang Ren 1, Michael.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Programming Languages and Paradigms Activation Records in Java.
CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.
Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.
Open64 | The Open Research Compiler Ben Reinhardt and Cliff Piontek.
Chapter 8: Part 3 Collections and Two-dimensional arrays.
Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)
Chapter 9 Introduction to Arrays Fundamentals of Java.
Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Optimizing the Performance of Sparse Matrix-Vector Multiplication
CS203 – Advanced Computer Architecture
Section 7: Memory and Caches
Chapter 7: Array.
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Automatic Performance Tuning
STUDY AND IMPLEMENTATION
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Optimizing MMM & ATLAS Library Generator
Numerical Algorithms Quiz questions
A Comparison of Cache-conscious and Cache-oblivious Codes
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
What does it take to produce near-peak Matrix-Matrix Multiply
Cache-oblivious Programming
Lecture 11: Machine-Dependent Optimization
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Generic Code Optimization Jack Dongarra, Shirley Moore, Keith Seymour, and Haihang You LACSI Symposium Automatic Tuning of Whole Applications Workshop October 11, 2005

LACSI Generic Code Optimization Take generic code segments and perform optimizations via experiments (similar to ATLAS) Collaboration with ROSE project (source-to-source code transformation / optimization) at Lawrence Livermore National Laboratory –Daniel Quinlan and Qing Yi

LACSI GCO A source-to-source transformation approach to optimize arbitrary code, especially loop nests in the code. Machine parameters detection Source to source code generation Testing driver generation Empirical search engine

LACSI GCO Framework front end code IR CG Search Engine Loop Analyzer code Driver generator testing driver info of tuning parameters tuning parameters +

LACSI Simplex Method Simplex method is a non-derivative direct search method for optimal value N+1 points of N dimension search space make up a simplex Basic Operations: reflection, expansion, contraction, and shrink.

LACSI X2X2 X3X3 X1X1 XcXc Basic idea of Simplex Method in 2D: To find maximum value of f(x) in a 2-dim search space The simplex consists X 1, X 2, X 3. suppose f(X 1 ) <= f(X 2 )<=f(X 3 ) In each step, we can find X c which is the centroid of X 2 and X 3, replace X 1 with X r which is the reflection of X 1 through X c. X2X2 X3X3 X1X1 X2X2 X3X3 XrXr XrXr Basic Simplex

LACSI dimensional space for search ATLAS does orthogonal searching Represents 1 M search points!! NB: Cache Blocking LAT: FP unit latency MU NU: Register Blocking KU: unrolling FFTCH: determine prefetch of matrix C into registers IFTCH NFTCH: determine the interleaving of loads with computation simplex32 LAT NB MU NU KU FFTCH IFTCH NFTCH upper bound lower bound DGEMM ATLAS Search Space

LACSI Comparison of performance of DGEMM generated with ATLAS and Simplex search

LACSI Comparison of performance of DGEMM generated with ATLAS and Simplex search

LACSI Comparison of performance of DGEMM generated with ATLAS and Simplex search

LACSI Comparison of performance of DGEMM generated with ATLAS and Simplex search

LACSI Comparison of performance of DGEMM on1000x1000 matrix generated with ATLAS and Simplex search

LACSI Comparison of parameters’ search time ATLAS and Simplex search

LACSI Comparison of performance of DGEMM generated with ATLAS and Parallel GA(GridSolve)

LACSI Comparison of parameters’ search time ATLAS and Parallel GA(GridSolve)

LACSI Code Generation Collaboration with ROSE project (source-to-source code transformation/optimization) at Lawrence Livermore National Laboratory LoopProcessor -bk3 4 -unroll 4./dgemv.c

LACSI Testing Driver Generation Testing driver initializes variables and collects performance data. Wallclock time or Hardware counter data /*$ATLAS ROUTINE DGEMV */ /*$ATLAS SIZE 1000:2000:100 */ /*$ATLAS ARG M IN int $size */ /*$ATLAS ARG N IN int $size */ /*$ATLAS ARG ALPHA IN double 1.0 */ /*$ATLAS ARG A[M][N] IN double $rand */ /*$ATLAS ARG B[N] IN double $rand */ /*$ATLAS ARG C[M] INOUT double $rand */ void dgemv (int M, int N, double alpha, double* A, double * B, double* C) { int i, j; /* matrices are stored in column major */ for (i = 0; i < M; ++i) { for (j = 0; j <N ; ++j) { C[i] += alpha * A[j*M + i] * B[j]; }

LACSI int min(int,int ); /*$ATLAS ROUTINE DGEMV */ /*$ATLAS SIZE 1000:1000:1 */ /*$ATLAS ARG M IN int $size */ /*$ATLAS ARG N IN int $size */ /*$ATLAS ARG ALPHA IN double 1.0 */ /*$ATLAS ARG A[M][N] IN double $rand */ /*$ATLAS ARG B[N] IN double $rand */ /*$ATLAS ARG C[M] INOUT double $rand */ void dgemv(int M,int N,double alpha,double * A,double * B,double * C) { int _var_1; int _var_0; int i; int j; for (_var_1 = 0; _var_1 <= -1 + M; _var_1 += 4) { for (_var_0 = 0; _var_0 <= -1 + N; _var_0 += 4) { for (i = _var_1; i <= min((-1 + M),(_var_1 + 3)); i += 1) { for (j = _var_0; j <= min((N + -4),_var_0); j += 4) { C[i] += (alpha * A[(j * M + i)]) * B[j]; C[i] += (alpha * A[((1 + j) * M + i)]) * B[(1 + j)]; C[i] += (alpha * A[((2 + j) * M + i)]) * B[(2 + j)]; C[i] += (alpha * A[((3 + j) * M + i)]) * B[(3 + j)]; } for (; j <= min((-1 + N),(_var_0 + 3)); j += 1) { C[i] += (alpha * A[(j * M + i)]) * B[j]; } /*$ATLAS ROUTINE DGEMV */ /*$ATLAS SIZE 1000:1000:1 */ /*$ATLAS ARG M IN int $size */ /*$ATLAS ARG N IN int $size */ /*$ATLAS ARG ALPHA IN double 1.0 */ /*$ATLAS ARG A[M][N] IN double $rand */ /*$ATLAS ARG B[N] IN double $rand */ /*$ATLAS ARG C[M] INOUT double $rand */ void dgemv (int M, int N, double alpha, double* A, double * B, double* C) { int i, j; /* matrices are stored in column major */ for (i = 0; i < M; ++i) { for (j = 0; j <N ; ++j) { C[i] += alpha * A[j*M + i] * B[j]; }

LACSI Comparison of performance of DGEMV generated with ATLAS and Simplex search with ROSE

LACSI Comparison of performance of DGEMV generated with ATLAS and Simplex search with ROSE

LACSI Comparison of performance of DGEMV generated with ATLAS and Simplex search with ROSE

LACSI Comparison of performance of DGEMV generated with ATLAS and Simplex search with ROSE

LACSI void dgemv(int M,int N,double alpha,double * A,double * B,double * C) { int _var_1; int _var_0; int i; int j; int ub1, ub2; for (_var_1 = 0; _var_1 <= -1 + M; _var_1 += 113) { ub1 = rosemin((-8 + M),(_var_ )); for (_var_0 = 0; _var_0 <= -1 + N; _var_0 += 113) { ub2 = rosemin((N + -6),(_var_ )); for (i = _var_1; i <= ub1; i += 8) { for (j = _var_0; j <= ub2; j += 6) { register double bjp0, bjp1, bjp2, bjp3, bjp4, bjp5; register double cip0, cip1, cip2, cip3, cip4, cip5, cip6, cip7; register int t0, t1, t2, t3, t4, t5; t0 = j * M + i; t1 = (1 + j) * M + i; t2 = (2 + j) * M + i; t3 = (3 + j) * M + i; t4 = (4 + j) * M + i; t5 = (5 + j) * M + i; cip0 = C[i]; cip1 = C[i + 1]; cip2 = C[i + 2]; cip3 = C[i + 3]; cip4 = C[i + 4]; cip5 = C[i + 5]; cip6 = C[i + 6]; cip7 = C[i + 7]; bjp0 = B[j]; bjp1 = B[j+1]; bjp2 = B[j+2]; bjp3 = B[j+3]; bjp4 = B[j+4]; bjp5 = B[j+5]; cip0 += (A[t0]) * bjp0; …….. C[i+6] = cip6; C[i+7] = cip7; } for (; j <= rosemin(N-1,_var_0+112); j++) { C[i] += (alpha * A[j * M + i]) * B[j]; C[i+1] += (alpha * A[j * M + i+1]) * B[j]; ……. C[i+7] += (alpha * A[j * M + i+7]) * B[j]; } } for (; i <= rosemin((M-1),(_var_ )); i++) { for (j = _var_0; j <= rosemin((N + -1),(_var_ )); j++) { C[i] += (alpha * A[(j * M + i)]) * B[j]; } } } } }