Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Similar presentations


Presentation on theme: "Memory-Aware Compilation Philip Sweany 10/20/2011."— Presentation transcript:

1 Memory-Aware Compilation Philip Sweany 10/20/2011

2 Architectural Diversity “Simple” Load/Store Instruction-level parallel Heterogeneous multi-core parallelism “Traditional” parallel architectures – Vector – MIMD Many core Next???

3 Load/Store Architecture All arithmetic must take place in registers Cache hits typically 3-5 cycles Cache misses more like 100 cycles Compiler tries to keep scalars in registers Graph-coloring register assignment

4 Instruction-Level Parallelism (ILP) ILP architectures include: – Multiple pipelined functional units – Static or dynamic scheduling Compiler schedules instructions to reduce execution time – Local scheduling – Global scheduling – Software pipelining

5 “Typical” ILP Architecture 8 “generic” pipelined functional units Timing – Register operations require 1 cycle – Memory operations (load) require 5 cycles (hit) or 50 cycles (miss), pipelined of course – Stores are buffered so don’t require time directly.

6 Matrix Multiply Matrix_multiply a,b,c: int[4][4] for i from 0 to 3 for j from 0 to 3 c[i][j] = 0 for k from 0 to 3 c[i][j] += a[i][k] * b[k][j]

7 Single Loop Schedule (ILP) 1.t1 = a[i][k] # t2 = b[k][j] 2.nop 3.nop 4.nop 5.t3 = t1 * t2 6.t0 += t3 --- t0 = c[i][j] before loop and c[i][j] = t0 after loop

8 Software Pipelining Can “cover” any latency, removing nops from single-loop schedule IFF conditions are “right.” They are right for matrix multiply so, …

9 Software Pipelined Matrix Mult All the operations can be included in a single cycle, speeding up loop by a factor of 7. t1 = a[i][k], t2 = b[k][j], t3 = t1 -5 *t2 -5, t0 +=t3

10 Improved Software Pipelining? Unroll-and-Jam on nested loops can significantly shorten the execution time Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses.

11 Results of Software Pipelining Improvements Using unroll-and-jam on 26 FORTRAN nested loops before performing mod- ulo scheduling led to: – Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% – Increased register requirements greatly, often by a factor of 5.

12 Results of Software Pipelining Improvements Using a simple cache reuse model, our modulo scheduler Improved execution time roughly 11% over an all-hit assumption with little change in register usage Used 17.9% fewer registers than an all- miss assumption, while generating 8% slower code

13 Chiron Tesla Ducati Multi-CPU Shared Memory “OMAP” Resources FPGA

14 Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures

15 Dependence-Based Compilation Vectorization and Parallelization require a deeper analysis than optimization for scalar machines – Must be able to determine whether two accesses to the same array might be to the same location Dependence is the theory that makes this possible – There is a dependence between two statements if they might access the same location, there is a path from one to the other, and one access is a write Dependence has other applications – Memory hierarchy management—restructuring programs to make better use of cache and registers Includes input dependences – Scheduling of instructions

16 Syllabus I Introduction – Parallel and vector architectures. The problem of parallel programming. Bernstein's conditions and the role of dependence. Compilation for parallel machines and automatic detection of parallelism. Dependence Theory and Practice – Fundamentals, types of dependences. Testing for dependence: separable, gcd and Banerjee tests. Exact dependence testing. Construction of direction and distance vectors. Preliminary Transformations – Loop normalization, scalar data flow analysis, induction variable substitution, scalar renaming.

17 Syllabus II Fine-Grain Parallel Code Generation – Loop distribution and its safety. The Kuck vectorization principle. The layered vector code-generation algorithm and its complexity. Loop interchange. Coarse-Grain Parallel Code Generation – Loop Interchange. Loop Skewing. Scalar and array expansion. Forward substitution. Alignment. Code replication. Array renaming. Node splitting. Pattern recognition. Threshold analysis. Symbolic dependence tests. Parallel code generation and its problems. Control Dependence – Types of branches. If conversion. Control dependence. Program dependence graph.

18 Syllabus III Memory Hierarchy Management – The use of dependence in scalar register allocation and management of the cache memory hierarchy. Scheduling for Superscalar and Parallel Machines Machines – Role of dependence. List Scheduling. Software Pipelining. Work scheduling for parallel systems. Guided Self-Scheduling Interprocedural Analysis and Optimization – Side effect analysis, constant propagation and alias analysis. Flow- insensitive and flow-sensitive problems. Side effects to arrays. Inline substitution, linkage tailoring and procedure cloning. Management of interprocedural analysis and optimization. Compilation of Other Languages. – C, Verilog, Fortran 90, HPF.

19 What is High Performance Computing? What architectural models are there? What system software is required? Standard? How should we evaluate high performance? – Run time? – Run time x machine cost? – Speedup ? – Efficient use of CPU resources?


Download ppt "Memory-Aware Compilation Philip Sweany 10/20/2011."

Similar presentations


Ads by Google