O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer

T HE P ROBLEM LU is a common matrix operation with a broad range of applications Writes matrix as a product of L and U Example: PA= LU a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 010 100 001 l 11 00 l 21 l 22 0 l 31 l 32 l 33 u11u11 u12u12 u13u13 0 u22u22 u23u23 00 u33u33

T HE P ROBLEM

Small parallelism Big parallelism

O UTLINE Overview  Results Conclusion

O VERVIEW Four implementations of LU PLASMA (highly optimized third party library) Sivan Toledo’s algorithm in Cilk++ (courtesy of Bradley) Parallel standard “right-looking” in Cilk++ Right-looking in pthreads All implementations use same base case GotoBLAS2 matrix routines Analyze performance Machine architecture Cache behavior

O UTLINE Overview Results Summary  Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion

M ETHODOLOGY Machine configurations: AMD16 : Quad-quad AMD Opteron 8350 @ 2.0 GHz Intel16 : Quad-quad Intel Xeon E7340 @ 2.0 GHz Intel8: Dual-quad Intel Xeon E5530 @ 2.4 GHz Xen indicates running a Xen-enabled kernel All tests still ran in dom0 (outside virtual machine)

P ERFORMANCE S UMMARY Quite significant performance heterogeneity by machine architecture Large impact from caches LU performace (gflops on 4k x 4k, 8 cores) AMD16Intel16Intel16XenIntel8Xen PLASMA28.721.520.631.1 Toledo17.219.617.432.5 Right7.728.537.3823.2 Pthread12.511.210.822.1

LU S CALING

O UTLINE Overview Results Summary Architectural heterogeneity  Cache effects Parallelism Scheduling Code size Conclusion

A RCHITECTURAL V ARIATION ( BY ARCH.) AMD16Intel16 Intel8Xen

A RCHITECTURAL V ARIATION ( BY ALG ’ THM )

X EN I NTERFERENCE Strange behavior with increasing core count on Intel16 Intel16Xen   Intel16

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects  Parallelism Scheduling Code size Conclusion

C ACHE I NTERFERENCE Noticed scaling problem with Toledo algorithm Tested with matrices of size 2 n Caused conflict misses in processor cache

C ACHE I NTERFERENCE : EXAMPLE AMD Opteron has 64 byte cache lines and a 64 Kbyte 2-way set associative cache: 512 sets, 2 cache lines each Every 32Kbyte (or 4096 doubles) map to the same set offset set tag 056141563

C ACHE I NTERFERENCE : EXAMPLE 4096 elements

S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

C ACHE I NTERFERENCE ( GRAPHS ) Before: After:

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism  Scheduling Code size Conclusion

P ARALLELISM Toledo shows higher parallelism, particularly in burdened parallelism and large matrices Still doesn’t explain poor scaling of right at low numbers of cores Matrix SizeToledoRight-looking ParallelismBurdened Parallelism ParallelismBurdened Parallelism 2048x204815.815.516.012.2 4096x409638.137.434.626.0 8192x819292.691.172.857.3

S YSTEM F ACTORS (L OAD LATENCY ) Performance of Right relative to Toledo

S YSTEM F ACTORS (L OAD LATENCY ) Performance of Tile relative to Toledo

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling  Code size Conclusion

S CHEDULING Cilk++ provides dynamic scheduler PLASMA, pthread use static schedule Compare performance under multiprogrammed workload

S CHEDULING G RAPH Cilk++ implementations degrade more gracefully PLASMA does OK; pthread right (“tile”) doesn’t

O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size  Conclusion

C ODE STYLE * Includes base case wrappers Comparing different languages Expected large difference, but they are similar Complexity is in base case Base cases are shared Lines of Code ToledoRight- looking PLASMAPthread Right Just LU111121143134 Everything238257269934*

C ONCLUSION Cilk++ can perform competitively with optimized math libraries Cache behavior is most important factor Cilk++ shows better performance degradation with other things running Especially compared to hand-coded pthread versions Code size not a major factor

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

Similar presentations

Presentation on theme: "O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

Similar presentations

Presentation on theme: "O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer."— Presentation transcript:

Similar presentations

About project

Feedback