Download presentation
Presentation is loading. Please wait.
Published byCurtis O’Connor’ Modified over 9 years ago
1
Compiling for VIRAM Kathy Yelick Dave Judd, & Ronny Krashinsky Computer Science Division UC Berkeley
2
Compiling for VIRAM Long-term success of VIRAM depends on simple programming model, i.e., a compiler Needs to handle significant class of commercial applications –primary focus on multimedia, graphics, speech and image processing Needs to utilize hardware features for performance
3
Questions Is there sufficient vectorization potential in commercial applications? Can automatic vectorization find it? How difficult is code generation for the VIRAM ISA? How hard is optimization for the VIRAM-1 implementation?
4
IRAM Compilers IRAM/Cray vectorizing compiler [Judd] –Production compiler Used on the T90, C90, as well as the T3D and T3E Being ported (by SGI/Cray) to the SV2 architecture –Has C, C++, and Fortran front-ends (focus on C) –Extensive vectorization capability outer loop vectorization, scatter/gather, short loops, … –VIRAM port is under way IRAM/VSUIF vectorizing compiler [Krashinsky] –Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford –This is a “research” compiler, not intended for compiling large complex applications –It exists now!
5
IRAM/Cray Compiler Status MIPS backend developed since last retreat –Compiles ~90% of a commercial test suite for code generation –Generated code run through vas –Remaining issue with register spilling and loader Vector instructions being added Question: what information is “known” in each stage Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90
6
Vectorizing Mediaprocessing (Dubey) KernelVector length Matrix transpose/multiply# vertices at once DCT (video, comm.)image width FFT (audio)256-1024 Motion estimation (video)image width,i.w./16 Gamma correction (video)image width Haar transform (media mining)image width Median filter (image process.)image width Separable convolution (““)image width (from http://www.research.ibm.com/people/p/pradeep/tutor.html)
7
Vectorization in SPECint95 KernelVector length m88ksim5,16,32 comp512 decomp512 lisp interpreter (gc)80, 1000 ijpeg8,16, width/2, width (from K. Asanovic, UCB PhD, 1998)
8
Automatic Vectorization Vectorizing compilers very successful on scientific applications –not entirely automatic, especially for C/C++ –good tools for training users Multimedia applications have –shorter vector lengths –can sometime exploit outer loop vectorization for longer vectors –often leads to non-unit strides –tree traversals could be written as scatter/gather (breadth-first), although automating this is far from solved e.g., image compression
9
How Hard is Codegen for VIRAM? Variable virtual processor width (VPW) Variable vector register length (MVL) Fixed point arithmetic (saturating add, etc.) Memory Operations
10
Vector Architectural State VP 0 VP 1 VP vl-1 vr 0 vr 1 vr 31 vpw Data Registers Virtual Processors (vl) Number of VPs given by the Vector Length register vl Width of each VP given by the register vpw –vpw is one of {8b,16b,32b,64b} Maximum vector length is given by a read-only register mvl –mvl depends on implementation and vpw: {NA,128,64,32} in VIRAM-1
11
Generating Code for Variable VPW Strategy used in IRAM/VSUIF: use a compiler flag for (i = 0; i < N; i++) a[i] = b[i]*b[i]; for (i = 0; i < N; i++) c[i] = c[i]*2; Compile with the widest data type in the vector part of the program –a,b,c all arrays of doubles => vpw64 –any of a,b,c arrays of => vpw64 –a,b,c are all arrays of single => vpw64 ok, vpw32 faster –similarly for int Limitation: a single file cannot contain loops that should use different vpw’s –Reason: simplicity
12
Example: Decryption (IDEA) IDEA Decryption operates on 16-bit ints Compiled with IRAM/VSUIF (with unrolling by hand) # lanes
13
Generating Code for Variable VPW Planned strategy for IRAM/Cray: backend will determine minimum correct vpw for each loop nest for (i = 0; i < N; i++) for (j = 0; j < N; j++){ a[j,i] =...; b[i,j] =...; } for (i = 0; i < N; i++) c[i] = c[i]*2; Code gen will do a pass over each loop nest. Limitation: a single loop nest will run at the speed of the widest type. –Reason: simplicity & performance of the common case Assumes the vectorizer does not need vpw.
14
Generating Code for Variable MVL Maximum vector length is not specified in IRAM ISA. In Cray compiler, mvl is known at compile time. –Different customer/software model (recompiling not onerous) Three potential problems with unknown mvl –register spilling not done in IRAM/VSUIF –short loop vectorization not treated specially in IRAM/VSUIF –length-dependent vectorization ? for (i = 0; i < n; i=++) a[i] = a[i+32]
15
Register Spilling and MVL Register spilling –normally reserve fixed space in stack frame –with unknown MVL, compiler can’t determine stack frame layout Possible solutions –make MVL’s values visible in architectural specification –use dynamic allocation for spilling registers –hybrid use stack allocation as long as MVL is less or equal to VIRAM- 1’s size, namely 32 64-bit values; store “dope vector” in place otherwise; results in extra branch when spilling, which is much better than dynamic allocation in the “common” case –proposal: work out details of hybrid, but only implement VIRAM-1
16
Short Vectors and MVL Strip-mining turns arbitrary length loop into loop of vector operations for (i = 0; i < n; i+= mvl) for (ii = i; ii < mvl; ii++) c[ii] = a[ii] + b[ii]; Not a good idea if n<mvl –this was problematic in evaluating IRAM/VSUIF, since strip-mining by hand (in source) produces poor quality code –short vectors are common in multimedia applications Need by vectorizer (which does not know about vpw) –Cray compiler treats similar cases by generating multiple versions; doesn’t match our small-memory view Plan: avoid dependence on MVL in vectorizer Choice of “good” transformations is the hard problem. becomes vector instruction
17
Using VIRAM Arithmetic “Peak” performance depends on using of fused multiply-add, unlike Crays. Other compilers find these. Fixed point operations are not supported in HLLs –No plans for compiler support in VIRAM/Cray –although vectorizing over short data widths is possible in IRAM/VSUIF; likely in IRAM/Cray as well –Under what circumstances can one determine vpw/fixed point optimizations? IDCT does immediate round/pack after operations Motion estimation does a min after diff/abs –Easier than discovering fixed point from floating point
18
VIRAM Memory System Multiple base/stride registers –IRAM/VSUIF only uses 1, but more is not difficult –Autoincrement seems possible in principle and may be important for short-vector programs Memory barriers/synchronization –Experience on Cray vector machines: serious performance impact Only a single kind of (very expensive) barrier Some instructions had barrier-like side effects –Ordered scatters: for sparse matrices? Pointer-chasing? –Answers may depend on particular compiler assumptions, rather than fundamental problems
19
VIRAM/VSUIF Matrix/Vector Multiply VIRAM/VSUIF does reasonably well on long loops mvmvmm 256x256 single matrix Compare to 1600 Mflop/s (peak without multadd) Problems with –short loops –reductions
20
Conclusions Compiler support for short vectors, non-unit stride is key. Compilers can perform most of the transformations people are doing by hand on multi-media kernels –loop unrolling, interchange, outer loop vectorization, instruction scheduling –exceptions are fixed point and changing algorithms, e.g., DCT Major challenge is choosing the right transformations –avoid pessimizing xforms –avoid code bloat Good news: the performance of vector machines is much more easily captured by a performance model in the compiler than for OOO/SS with multi-level caches
21
Backup slides Applications and Compilers
22
How Hard is Optimization for VIRAM-1?
23
Example: Decryption (IDEA) IDEA Decryption operates on 16-bit ints Compiled with IRAM/VSUIF (with unrolling by hand) Performance in GOP/s # lanes Data width
24
Operation & Instruction Count: RISC v. “VSIW” Processor (from F. Quintana, U. Barcelona.) Spec92fp Operations (M) Instructions (M) Program RISC VSIW R / V RISC VSIW R / V swim25611595 1.1x1150.8142x hydro2d58401.4x 580.8 71x nasa769411.7x 692.2 31x su2cor51351.4x 511.8 29x tomcatv15101.4x 151.3 11x wave527251.1x 277.2 4x mdljdp232520.6x 3215.8 2x VSIW reduces ops by 1.2X, instructions by 20X!
25
Revive Vector (= VSIW) Architecture! Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Vector Performance? Power/Energy? Scalar performance? Real-time? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM = low latency, high bandwidth memory Much smaller than VLIW/EPIC For sale, mature (>20 years) Easy scale speed with technology Parallel to save energy, keep perf Include modern, modest CPU OK scalar (MIPS 5K v. 10k) No caches, no speculation repeatable speed as vary input Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b
26
Compiler Status: S98 Review Plans to use SGI’s compiler had fallen through Short-term effort identified: VIC –based on small extensions to C –mainly to ease benchmarking effort and get more people involved –end of summer deadline for completion Intermediate term: SUIF –based on VSUIF from Toronto –retarget backend Other possibilities: Java compiler (Titanium)
27
Vector Surprise Use vectors for inner loop parallelism (no surprise) –One dimension of array: A[0, 0], A[0, 1], A[0, 2],... –think of machine as 32 vector regs each with 64 elements –1 instruction updates 64 elements of 1 vector register and for outer loop parallelism! –1 element from each column: A[0,0], A[1,0], A[2,0],... –think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! ( multithreaded processor) –1 instruction updates 1 scalar register in 64 VPs Hardware identical, just 2 compiler perspectives
28
Software Technology Trends Affecting V-IRAM? V-IRAM: any CPU + vector coprocessor/memory –scalar/vector interactions are limited, simple –Example V-IRAM architecture based on ARM 9, MIPS Vectorizing compilers built for 25 years –can buy one for new machine from The Portland Group Microsoft “Win CE”/ Java OS for non-x86 platforms Library solutions (e.g., MMX); retarget packages Software distribution model is evolving? –New Model: Java byte codes over network? + Just-In-Time compiler to tailor program to machine?
29
Short-term compiler objectives Intermediate solution to demonstrate IRAM features and performance while other compiler effort ongoing –Generate efficient vector code, particularly for key loops –Allow use of IRAM-specific ISA features Variable VP width Flag registers (masked operations) Not a fully automatic vectorizing compiler –But higher level than assembly language Programs should be easily convertible to format suitable as input to standard vectorizing compiler or scalar compiler Up and running quickly –Minimal compiler modifications necessary –Easy to debug
30
Current compiler infrastructure vic: source-to-source translator –input: C with vector extensions –output: C with embedded V-IRAM assembly code vas: V-IRAM assembler two paths for vic code: debugging or simulation.vic vicgccvasvsim.c + asm.s a.out.c gcc a.out Sun, SGI, PC,...
31
Vic language extensions to C Based on register-register machine model Special C variables represent vector registers –User responsible for register allocation vfor keyword indicates loop should be vectorized –User must strip-mine loops explicitly, but compiler determines appropriate vector length #define x vr0 #define y vr1 #define z vr2 vfor (i=0; i < 32; i++) { z[i] = x[i] + y[i]; } VL, MVL, VPW used to access vector control regs.
32
Language extensions, cont. Unit stride, non-unit stride, and indexed loads/stores int A[256]; vfor (i=0; i < 32; i++) { x[i] = A[2*i+15]; /* Stride 2, start with 15th element */ } Masked operations using vif keyword vfor (i=0; i < 32; i++) { vif (x[i] < y[i]) { z[i] = x[i] + y[i]; } Operations with no obvious mapping from C expressed using macros EXTRACT(x, y, 3, 10);
33
Kernels in Vic Vector Add: C = A + B Vector-Matrix Multiply: C[n] = A[m] * B[m][n] Matrix-Vector Multiply: C[m] = B[m][n] * A[n] Matrix Transpose Load store tests FFT, two versions: –Based on standard Cooley-Tukey algorithm –VIRAM to mask memory ops and use extracts (underway) Sparse Matrix-Vector multiply (underway) QR Decomposition (underway)
34
Vic implementation Written in PERL –Easy to write and maintain Relies heavily on gcc code optimization –Makes generating vic output simpler Generates code for old VIRAM ISA –Some small changes –Doesn’t use new vector ld/st with single address register –Doesn’t handle shadow of scalar registers in vector unit fix by adding move-to/from around vector instructions that use scalar registers or, do real register allocation (not in VIC)
35
Intermediate-term compiler: SUIF Port VSUIF to V-IRAM ISA –Vectorizing version of Stanford SUIF compiler –C and F77 front-ends –Currently generates code for Vector ISA from University of Toronto T0 Sun VIS Toronto plans for VSUIF extensions –Add outer-loop vectorization –Support masked operations Berkeley undergraduate is currently working on VSUIF port to V-IRAM ISA
36
Compiler Status: W99 Plans to use SGI’s compiler revived VIC –good news: initial VIC implementation done 7/98 (Oppenheimer) –more good news: undergrad benchmarking group (Thomas) –bad news: change of ISA makes VIC obsolete SUIF –steep learning curve: vectors, VIRAM, SUIF (Krashinsky) –starting basic backend modifications –may need additional people
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.