Download presentation
Presentation is loading. Please wait.
Published byClaude Burns Modified over 9 years ago
1
O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer
2
T HE P ROBLEM LU is a common matrix operation with a broad range of applications Writes matrix as a product of L and U Example: PA= LU a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 010 100 001 l 11 00 l 21 l 22 0 l 31 l 32 l 33 u11u11 u12u12 u13u13 0 u22u22 u23u23 00 u33u33
3
T HE P ROBLEM
6
Small parallelism Big parallelism
7
O UTLINE Overview Results Conclusion
8
O VERVIEW Four implementations of LU PLASMA (highly optimized third party library) Sivan Toledo’s algorithm in Cilk++ (courtesy of Bradley) Parallel standard “right-looking” in Cilk++ Right-looking in pthreads All implementations use same base case GotoBLAS2 matrix routines Analyze performance Machine architecture Cache behavior
9
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
10
M ETHODOLOGY Machine configurations: AMD16 : Quad-quad AMD Opteron 8350 @ 2.0 GHz Intel16 : Quad-quad Intel Xeon E7340 @ 2.0 GHz Intel8: Dual-quad Intel Xeon E5530 @ 2.4 GHz Xen indicates running a Xen-enabled kernel All tests still ran in dom0 (outside virtual machine)
11
P ERFORMANCE S UMMARY Quite significant performance heterogeneity by machine architecture Large impact from caches LU performace (gflops on 4k x 4k, 8 cores) AMD16Intel16Intel16XenIntel8Xen PLASMA28.721.520.631.1 Toledo17.219.617.432.5 Right7.728.537.3823.2 Pthread12.511.210.822.1
12
LU S CALING
13
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
14
A RCHITECTURAL V ARIATION ( BY ARCH.) AMD16Intel16 Intel8Xen
15
A RCHITECTURAL V ARIATION ( BY ALG ’ THM )
16
X EN I NTERFERENCE Strange behavior with increasing core count on Intel16 Intel16Xen Intel16
17
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
18
C ACHE I NTERFERENCE Noticed scaling problem with Toledo algorithm Tested with matrices of size 2 n Caused conflict misses in processor cache
19
C ACHE I NTERFERENCE : EXAMPLE AMD Opteron has 64 byte cache lines and a 64 Kbyte 2-way set associative cache: 512 sets, 2 cache lines each Every 32Kbyte (or 4096 doubles) map to the same set offset set tag 056141563
20
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
21
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
22
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
23
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
24
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
25
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
26
C ACHE I NTERFERENCE : EXAMPLE 4096 elements
27
S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad
28
S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad
29
S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad
30
S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad
31
C ACHE I NTERFERENCE ( GRAPHS ) Before: After:
32
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
33
P ARALLELISM Toledo shows higher parallelism, particularly in burdened parallelism and large matrices Still doesn’t explain poor scaling of right at low numbers of cores Matrix SizeToledoRight-looking ParallelismBurdened Parallelism ParallelismBurdened Parallelism 2048x204815.815.516.012.2 4096x409638.137.434.626.0 8192x819292.691.172.857.3
34
S YSTEM F ACTORS (L OAD LATENCY ) Performance of Right relative to Toledo
35
S YSTEM F ACTORS (L OAD LATENCY ) Performance of Tile relative to Toledo
36
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
37
S CHEDULING Cilk++ provides dynamic scheduler PLASMA, pthread use static schedule Compare performance under multiprogrammed workload
38
S CHEDULING G RAPH Cilk++ implementations degrade more gracefully PLASMA does OK; pthread right (“tile”) doesn’t
39
O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion
40
C ODE STYLE * Includes base case wrappers Comparing different languages Expected large difference, but they are similar Complexity is in base case Base cases are shared Lines of Code ToledoRight- looking PLASMAPthread Right Just LU111121143134 Everything238257269934*
41
C ONCLUSION Cilk++ can perform competitively with optimized math libraries Cache behavior is most important factor Cilk++ shows better performance degradation with other things running Especially compared to hand-coded pthread versions Code size not a major factor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.