Download presentation
Presentation is loading. Please wait.
Published byAsher Potter Modified over 9 years ago
1
A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou 2, Albert Cohen 5, María Jesús Garzarán 3, David Padua 3, and Keshav Pingali 4 1 BULL S.A. 2 University of Versailles 3 University of Illinois at Urbana-Champaign 4 Cornell University 5 INRIA Futurs International Workshop LCPC 2005
2
2 Outline Context in optimization for high performance Goals of this language Features of this language Examples (Daxpy & Dgemm) Conclusion
3
International Workshop LCPC 20053 Context Complex architecture and fragile optimizations Unpredictable performance Architecture, domain-specific optimizations Resort to empirical search Complement general-purpose optimizations with user-driven ones
4
International Workshop LCPC 20054 Example FFT performance Reasonable implementation (Numerical recipes. GNU scientific library) best available implementation (FFTW, Intel IPP, Spiral)
5
International Workshop LCPC 20055 Goals of X-Language Tool to help programmers generate and evaluate multiple versions of their programs: Applying control and data structure transformations Trying multiple transformation sequences and parameters Evaluating performance of each version and taking decisions about which transformation variants to try
6
International Workshop LCPC 20056 Goals of X-Language (cont.) The code must be portable accross ISO-C compilers: Use #pragma annotations for the above tasks Observable program semantics not altered by the interpretation of these pragmas (assuming transformation legality)
7
International Workshop LCPC 20057 Comparaison with related works Transformation Generation Black box Manual Domain specific General purpose Spiral Atlas Tick C Reflection Compiler XLG X-Language
8
International Workshop LCPC 20058 Features of the language Elementary transformations (fission, stripmining, interchanging, unrolling,…) Composition of transformations Conditional transformations (versioning) Procedural abstraction of transformations A mechanism to define new transformations No validity check is performed for the transformation
9
International Workshop LCPC 20059 General schema of X-Language Code with Pragmas Transformation Descriptions Execute and measure performance search Different versions Compile
10
International Workshop LCPC 200510 X-Language Naming loops or scopes #pragma xlang name loop1 for(i=0;i<10;i++) {a[i]=4;} Format of transformation #pragma xlang stripmine loop1 4 ii #pragma xlang Transformation name Loop name parameters Name of additional loops generated by transformations
11
International Workshop LCPC 200511 Elementary transformations implemented in X-language Full unrolling Partial unrolling Scalar promote Interchange Loop fission Loop fusion Strip mining Lifting Sofware pipelining
12
International Workshop LCPC 200512 Applying transformation #pragma xlang loop1 for(i=min;i<4*max;i++) a[i]=b[i] #pragma xlang stripmine loop1 4 ii #pragma xlang loop1 for(i=min;i<4*max;i+=4) int nl1; #pragma xlang ii for(nl1=0;nl1<4;nl1 ++) a[i+nl1]=b[i+nl1]
13
International Workshop LCPC 200513 How to search the value of parameters ? Using multistage evaluation External script for(k=1;k<16;k=2*k) ‘{ #pragma xlang loop1 for(i=min;i<max;i++) a[i]=b[i] #pragma xlang stripmine loop1 ‘d(k) ii ‘}
14
International Workshop LCPC 200514 Composing transformations #pragma xlang loop1 for(i=0;i<4;i++) #pragma xlang loop2 for(j=min2;j<max2;j++) a[i]=b[j] #pragma xlang interchange loop1 loop2 #pragma xlang fullunroll loop1 #pragma xlang loop2 for(j=min2;j<max2;j++) { a[0]=b[j]; a[1]=b[j]; a[2]=b[j]; a[3]=b[j]; }
15
International Workshop LCPC 200515 Analyses and Transformations Static analyses should also enable the design of smarter (higher level) transformation primitives External tool to find information
16
International Workshop LCPC 200516 Example with analysis for(i=2;i<2*N;i+=2) {u[i]=u[i-1]+u[i-2]; u[i+1]=u[i]+u[i-1];} for(i=2;i<2*N;i+=2) {u_1=u[i-1]; u_2=u[i-2]; u_0 = u_1 + u_2; u_1 = u_0 + u_1; u[i]=u_0; u[i+1]=u _1;} Without interference graph u_0=u[0]; u_1=u[1]; for(i=2;i<2*N;i+=2) {u_0 = u_1 + u_2; u_1 = u_0 + u_1;} u[i]=u_0; u[i+1]=u _1;} With interference graph
17
International Workshop LCPC 200517 Extending the X-Language Rewriting rule : #pragma xlang name iloop for (i = 0; i < N; i++) { } % Pattern before Pattern after transformation #pragma xlang name iiloop1 for (ii = 0; ii < (N/4)*4; ii += 4) #pragma xlang name iloop1 for (i = ii; i < ii+4; i++) { } #pragma xlang name iloop2 for (i = (N/4)*4; i < N; i++) f { } %
18
International Workshop LCPC 200518 Daxpy Example #pragma xlang name loop1 for(k=0;k<2000;k++) Y[k]=alpha*X[k]*Y[k]; We can modify values of N /** A few values tested for unrolling factor – Different generated version **/ #pragma xlang transform stripmine loop1 k N; #pragma xlang transform scalarize-in X in loop1 #pragma xlang transform lift l1.loads before loop1 #pragma xlang transform scalarize-out Y in loop1 #pragma xlang transform lift loop1.loads before loop1 #pragma xlang transform lift loop1.stores after loop1 #pragma xlang transform fullunroll loop1.loads #pragma xlang transform fullunroll loop1.stores #pragma xlang transform fullunroll loop1
19
International Workshop LCPC 200519 Daxpy Example – Different generated versions Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; } Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;} Unrolling factor : 8 for(k=0;k<2000;k=k+16){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; … y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; … Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2; Y[k+3] = y_3; … }
20
International Workshop LCPC 200520 Matrix Multiply (Loop Declaration) #pragma xlang name iloop for (i = 0; i < NB; i++) #pragma xlang name jloop for (j = 0; j < NB; j++) #pragma xlang name kloop for (k = 0; k < NB; k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } The DGEMM example: Matrix Multiplication Problems : Data locality Scheduling
21
International Workshop LCPC 200521 Matrix Multiply (Transformation Declaration) #pragma xlang transform stripmine iloop NU NUloop #pragma xlang transform stripmine jloop MU MUloop #pragma xlang transform interchange kloop MUloop #pragma xlang transform interchange jloop NUloop #pragma xlang transform interchange kloop NUloop #pragma xlang transform fullunroll NUloop #pragma xlang transform fullunroll MUloop #pragma xlang transform scalarize_in b in kloop #pragma xlang transform scalarize_in a in kloop #pragma xlang transform scalarize_in&out c in kloop #pragma xlang transform lift kloop.loads before kloop #pragma xlang transform lift kloop.stores after kloop Sequence of transformations for Itanium:
22
International Workshop LCPC 200522 Matrix Multiply (Transformation Sequence) #pragma xlang name iloop for(i = 0; i < NB; i++){ #pragma xlang name jloop for(j = 0; j < NB; j += 4){ #pragma xlang name kloop.loads {c_0_0 = c[i+0][j+0]; c_0_1 = c[i+0][j+1]; c_0_2 = c[i+0][j+2]; c_0_3 = c[i+0][j+3]; } #pragma xlang name kloop for(k = 0; k < NB; k++){ {a_0 = a[i+0][k]; a_1 = a[i+0][k]; a_2 = a[i+0][k]; a_3 = a[i+0][k];} {b_0 = b[k][j+0]; b_1 = b[k][j+1]; b_2 = b[k][j+2]; b_3 = b[k][j+3];} {c_0_0=c_0_0+a_0*b_0; c_0_1=c_0_1+a_1*b_1; c_0_2=c_0_2+a_2*b_2; c_0_3=c_0_3+a_3*b_3;}... } #pragma xlang name kloop.stores {c[i+0][j+0] = c_0_0; c[i+0][j+1] = c_0_1; c[i+0][j+2] = c_0_2; c[i+0][j+3] = c_0_3;} }}... // Remainder code
23
International Workshop LCPC 200523 Block copies Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB) Poor performance of C copy Resort to a tool generating specific asm code Tool generating a good code with search (XLG is an asm search)
24
International Workshop LCPC 200524 Matrix Multiply (Results)
25
International Workshop LCPC 200525 Conclusion Describe transformations with reuse, procedures, conditionals X-Language: language designed to generate multiversion programs Multistage language with a flexible pattern-matching and rewriting language Experts can describe specific application transformation optimizations
26
International Workshop LCPC 200526 Future works Dependence analysis Going further searching asm code transformation More transformations: vectorization, alignment,…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.