Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.

Similar presentations


Presentation on theme: "Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer."— Presentation transcript:

1 Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer Science Division UC Berkeley

2 Compiling for VIRAM Long-term success of DIS technology depends on simple programming model, i.e., a compiler Needs to handle significant class of applications –IRAM: multimedia, graphics, speech and image processing –ISTORE: databases, signal processing, other DIS benchmarks Needs to utilize hardware features for performance –IRAM: vectorization –ISTORE: scalability of shared-nothing programming model

3 IRAM Compilers IRAM/Cray vectorizing compiler [Judd] –Production compiler Used on the T90, C90, as well as the T3D and T3E Being ported (by SGI/Cray) to the SV2 architecture –Has C, C++, and Fortran front-ends (focus on C) –Extensive vectorization capability outer loop vectorization, scatter/gather, short loops, … –VIRAM port is under way IRAM/VSUIF vectorizing compiler [Krashinsky] –Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford –This is a “research” compiler, not intended for compiling large complex applications –It has been working since 5/99.

4 IRAM/Cray Compiler Status MIPS backend developed in this year –Validated using a commercial test suite for code generation –Generated code run through vas assembler Vector backend recently started –Testing with vsim under way this week Leveraging from Cray –Automatic vectorization –Basic instruction scheduling framework Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90

5 ISTORE Compiler Titanium language is an extension of Java –tc is the Titanium compiler Recent progress: –improved portability of generated code and the compiler itself, including port to Cray parallel machines –additions to generate annotations on C code to improve fine- grained parallelism (on Tera MTA) and vectorization New benchmarking efforts –database primitives: sorting, hash-join and index-nested-loop join –3d FFT and linear solvers (LU) Optimizer Java Titanium C + comm ISTORE t3e tc cc Code Gen C compiler

6 Applications Hand-written kernels for single-chip VIRAM –focus on multimedia kernels, see IRAM hardware talk Compiled programs for single-chip VIRAM –2 examples from IRAM/VSUIF: decryption and mvm –most effort devoted to IRAM/Cray compiler Performance benchmarks for ISTORE –3d FFT –Others SAM benchmarks for ISTORE

7 Automatic Vectorization Vectorizing compilers very successful on scientific applications –not entirely automatic, especially for C/C++ –good tools for training users Multimedia applications have –shorter vector lengths –can sometime exploit outer loop vectorization for longer vectors –often leads to non-unit strides –tree traversals could be written as scatter/gather (breadth-first), although automating this is far from solved e.g., image compression

8 IRAM/VSUIF Decryption (IDEA) IDEA Decryption operates on 16-bit ints Compiled with IRAM/VSUIF (with unrolling by hand) Note scalability of both #lanes and data width # lanes

9 VIRAM/VSUIF Matrix/Vector Multiply VIRAM/VSUIF does reasonably well on long loops mvmvmm 256x256 single matrix Compare to 1600 Mflop/s (peak without multadd) Note BLAS-2 (little reuse) ~350 on Power3 and EV6 Problems specific to VSUIF –hand strip-mining results in short loops –reductions –no multadd support

10 3D FFT on ISTORE Performance of large 3D FFT’s depend on 2 factors –speed of 1D FFT on a single node (next slide) –network bandwidth for “transposing” data –1.3 Tflop FFT possible w/ 1K IRAM nodes and.5 TB/s bw

11 1D FFT on IRAM TigerSHARC DSP 41us (Analog Devices) ( 32bit) IRAM 37us (32bit) TMS320C6000 DSP 124us (Texas Instruments) (32 bits) DSP56002 DSP 908 us (Motorola) (24 bits) FFT study on IRAM [Randi Thomas] –hand-coded and scheduled –use of ISA features to make in-register FFTs fast (128 point) –bit-reversal time not included; will also use ISA support

12 Other ISTORE Applications Working on several performance applications for ISTORE –Database primitives: sorts, joins, scans, etc. [Kar Ming Tang] –RT_STAP QR Decomposition vectorizes easily, partially complete in IRAM/VSUIF –Conjugate Gradient [Samson Kwok] Dominated by sparse matrix-vector multiply Current performance: 500/250 Mflops (single/double) on VIRAM Compare to 10s of Mflops on most RISC machines –Dense linear algebra [Simon Yau] –Considering other DIS benchmarks, such as MoM

13 Conclusions Significant compiler progress: –Cray collaboration key [Dave Judd UCB @ Eagan ] –Good tech transfer model –Vector code gen and instruction scheduling next steps Even VSUIF version indicates reasonable performance –Commercial-quality compiler will allow non-toy applications, e.g., Speech Benchmarks –Have been used to help with final ISA design –Simulated results validate performance claims –Models show real advantage to Intelligence in Memory (and Disk) –Machines scale and with simpler programming and optimization model than conventional multiprocessors


Download ppt "Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer."

Similar presentations


Ads by Google