Download presentation
Presentation is loading. Please wait.
Published byClarence Parslow Modified over 9 years ago
1
. G.Bilardi. –University of Padua, Dipartimento di Elettronica e informatica. P.D’Alberto and A.Nicolau. –University of California at Irvine, Information and Computer Science += * C C0 C1 C2C3 A A0A1 A2A3 B B0B1 B2B3 Fractal Matrix Multiply
2
Talk Organization Motivations –Alias “why matrix multiply is so popular ?’’ Why did we jump into the Project ? Matrix multiply as it is done –How we differ Our Approach (performance related stuff) –How we did it –Experimental results Conclusions
3
Motivations Matrix Multiply as Example –For data reuse every element is used for n multiplications –Space requirements Sizes, layouts Matrix Multiply as Kernel –3-BLAS applications E.g. LU-decomposition
4
Why did we jump into the Project? Matrix Multiply is asymptotically optimal Cache hierarchy oblivious –Alias Cache hierarchy oblivious –n 3 -multiplications and kn 3 -misses (k <=3) –By Hung-Kung We can study safely different algorithms: –Safely: we do not loose optimality –Different algorithms: computation orders
5
Why we jumped in the project ? Cont. Optimal use of caches = optimal performance ? –Not really Performance: –Register allocation, scheduling, layouts, recursion/no recursion, RISC/no RISC architecture, compiler optimizations, ….. etcetera We want performance –MFLOPS
6
Multiplication as it is done 1.Tiling for L1 –Reduction to a single simple common problem –Then L2, L3 …. 2.Register allocation on the simple problem : –Number of registers –No RISC/RISC (Pentium/no Pentium) 3.Scheduling by compiler 4.Feedback and start over again if necessary
7
ATLAS for example: CA B Tiles fixed in size Registers = 4+2+2 Tiles Copied in a Contiguous Workspace
8
How we differ from the others? We present –A unique Recursive Algorithm The Decomposition function of the problem size –Recursive Layout (Fractal layout alias Z-Morton) –Register allocation tuned on the number of registers for Register-file-based architecture Automatic generated –Optimization of the index computation and recursion –Scheduling by compiler
9
Our Approach: Fractal Layout (alias Z-Morton) A is near square matrix then A0, A1, A2, A3 are near square matrixes about ¼ the size of A and A0 is the largest. Near square Near square: |row-columns| <= 1 A0 A2A3 A1 A 17 8 9 8 9 Layout in memory Sequential
10
Our Approach 1.A square problem is decomposed into 8 near square problems of size between and 2.Each sub-problem has the operands stored contiguously –TNX: the recursive layout 3.A sub-problem is decomposed if min(k,j,l) >32 4.Otherwise is solved directly –The operands are in row major format –Optimized at register-file level Reuse of common optimizations
11
Our Approach, cont. The Type DAG A recursion tree for problem has O(8 log n) different types The type determines the index computation for the sub-problems The types and the matrix offsets are determined and stored in a tree-like structure “type DAG’’ Reduction of index computations by 30% –With moderate extra space.
12
Recursive Tree and Type DAG C0+=A0B0 C0+=A1B2 C1+=A1B1 C1+=A0B3 C3+=A3B3 C3+=A2B1 C2+=A2B2 C2+=A3B0
13
Our approach, cont. Register Allocation When the recursion stops: 1.Sub-Problems smaller than are computed directly 2.Sub-Matrix smaller than 32 by 32 are stored in row major 3.Register Allocation 1.Fractal register allocation 2.C-tiling register allocation
14
Register Allocation, Fractal We applied the recursive decomposition at register level –We balance the distribution of registers for each matrix Adv: –Register file is considered as L0 Disadv: –The computation is expressed as straight line code, code explosion
15
Register Allocation, C-tiling No balanced distribution of registers R –s 2 registers for C, s for A and s for B (Use of 2s+s 2 Registers) The C is tiled further in sub-squares s x s and for each of them –s x s square of C tile is loaded in registers 1.s x 1 of A Tile is loaded in registers 2.1 x s of B Tile is loaded in registers 3.Scalar product
16
C-tiling, cont. Adv: more efficient than Fractal, reducing loads+stores Disadv: the register file is considered differently C A B
17
Cache Performance ULTRA5
18
Cache Performance SPARC5
19
MFLOPS Performance Pentium II
20
MFLOPS R5K_ip32 ultra2
21
Conclusions Algorithms exploiting cache hierarchy without taking in account cache parameters Performance is achieved optimizing the recursion: –Carefully pruning –Index computation optimization We used the matrix Multiply: –For LU-decomposition Improving further the performance
22
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.