. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information and Computer Science += * C C0 C1 C2C3 A A0A1 A2A3 B B0B1 B2B3 Fractal Matrix Multiply
Talk Organization Motivations –Alias “why matrix multiply is so popular ?’’ Why did we jump into the Project ? Matrix multiply as it is done –How we differ Our Approach (performance related stuff) –How we did it –Experimental results Conclusions
Motivations Matrix Multiply as Example –For data reuse every element is used for n multiplications –Space requirements Sizes, layouts Matrix Multiply as Kernel –3-BLAS applications E.g. LU-decomposition
Why did we jump into the Project? Matrix Multiply is asymptotically optimal Cache hierarchy oblivious –Alias Cache hierarchy oblivious –n 3 -multiplications and kn 3 -misses (k <=3) –By Hung-Kung We can study safely different algorithms: –Safely: we do not loose optimality –Different algorithms: computation orders
Why we jumped in the project ? Cont. Optimal use of caches = optimal performance ? –Not really Performance: –Register allocation, scheduling, layouts, recursion/no recursion, RISC/no RISC architecture, compiler optimizations, ….. etcetera We want performance –MFLOPS
Multiplication as it is done 1.Tiling for L1 –Reduction to a single simple common problem –Then L2, L3 …. 2.Register allocation on the simple problem : –Number of registers –No RISC/RISC (Pentium/no Pentium) 3.Scheduling by compiler 4.Feedback and start over again if necessary
ATLAS for example: CA B Tiles fixed in size Registers = Tiles Copied in a Contiguous Workspace
How we differ from the others? We present –A unique Recursive Algorithm The Decomposition function of the problem size –Recursive Layout (Fractal layout alias Z-Morton) –Register allocation tuned on the number of registers for Register-file-based architecture Automatic generated –Optimization of the index computation and recursion –Scheduling by compiler
Our Approach: Fractal Layout (alias Z-Morton) A is near square matrix then A0, A1, A2, A3 are near square matrixes about ¼ the size of A and A0 is the largest. Near square Near square: |row-columns| <= 1 A0 A2A3 A1 A Layout in memory Sequential
Our Approach 1.A square problem is decomposed into 8 near square problems of size between and 2.Each sub-problem has the operands stored contiguously –TNX: the recursive layout 3.A sub-problem is decomposed if min(k,j,l) >32 4.Otherwise is solved directly –The operands are in row major format –Optimized at register-file level Reuse of common optimizations
Our Approach, cont. The Type DAG A recursion tree for problem has O(8 log n) different types The type determines the index computation for the sub-problems The types and the matrix offsets are determined and stored in a tree-like structure “type DAG’’ Reduction of index computations by 30% –With moderate extra space.
Recursive Tree and Type DAG C0+=A0B0 C0+=A1B2 C1+=A1B1 C1+=A0B3 C3+=A3B3 C3+=A2B1 C2+=A2B2 C2+=A3B0
Our approach, cont. Register Allocation When the recursion stops: 1.Sub-Problems smaller than are computed directly 2.Sub-Matrix smaller than 32 by 32 are stored in row major 3.Register Allocation 1.Fractal register allocation 2.C-tiling register allocation
Register Allocation, Fractal We applied the recursive decomposition at register level –We balance the distribution of registers for each matrix Adv: –Register file is considered as L0 Disadv: –The computation is expressed as straight line code, code explosion
Register Allocation, C-tiling No balanced distribution of registers R –s 2 registers for C, s for A and s for B (Use of 2s+s 2 Registers) The C is tiled further in sub-squares s x s and for each of them –s x s square of C tile is loaded in registers 1.s x 1 of A Tile is loaded in registers 2.1 x s of B Tile is loaded in registers 3.Scalar product
C-tiling, cont. Adv: more efficient than Fractal, reducing loads+stores Disadv: the register file is considered differently C A B
Cache Performance ULTRA5
Cache Performance SPARC5
MFLOPS Performance Pentium II
MFLOPS R5K_ip32 ultra2
Conclusions Algorithms exploiting cache hierarchy without taking in account cache parameters Performance is achieved optimizing the recursion: –Carefully pruning –Index computation optimization We used the matrix Multiply: –For LU-decomposition Improving further the performance
Thank you