Presentation is loading. Please wait.

Presentation is loading. Please wait.

O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer.

Similar presentations


Presentation on theme: "O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer."— Presentation transcript:

1 O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer

2 T HE P ROBLEM LU is a common matrix operation with a broad range of applications Writes matrix as a product of L and U Example: PA= LU a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 010 100 001 l 11 00 l 21 l 22 0 l 31 l 32 l 33 u11u11 u12u12 u13u13 0 u22u22 u23u23 00 u33u33

3 T HE P ROBLEM

4

5

6 Small parallelism Big parallelism

7 O UTLINE Overview  Results Conclusion

8 O VERVIEW Four implementations of LU PLASMA (highly optimized third party library) Sivan Toledo’s algorithm in Cilk++ (courtesy of Bradley) Parallel standard “right-looking” in Cilk++ Right-looking in pthreads All implementations use same base case GotoBLAS2 matrix routines Analyze performance Machine architecture Cache behavior

9 O UTLINE Overview Results Summary  Architectural heterogeneity Cache effects Parallelism Scheduling Code size Conclusion

10 M ETHODOLOGY Machine configurations: AMD16 : Quad-quad AMD Opteron 8350 @ 2.0 GHz Intel16 : Quad-quad Intel Xeon E7340 @ 2.0 GHz Intel8: Dual-quad Intel Xeon E5530 @ 2.4 GHz Xen indicates running a Xen-enabled kernel All tests still ran in dom0 (outside virtual machine)

11 P ERFORMANCE S UMMARY Quite significant performance heterogeneity by machine architecture Large impact from caches LU performace (gflops on 4k x 4k, 8 cores) AMD16Intel16Intel16XenIntel8Xen PLASMA28.721.520.631.1 Toledo17.219.617.432.5 Right7.728.537.3823.2 Pthread12.511.210.822.1

12 LU S CALING

13 O UTLINE Overview Results Summary Architectural heterogeneity  Cache effects Parallelism Scheduling Code size Conclusion

14 A RCHITECTURAL V ARIATION ( BY ARCH.) AMD16Intel16 Intel8Xen

15 A RCHITECTURAL V ARIATION ( BY ALG ’ THM )

16 X EN I NTERFERENCE Strange behavior with increasing core count on Intel16 Intel16Xen   Intel16

17 O UTLINE Overview Results Summary Architectural heterogeneity Cache effects  Parallelism Scheduling Code size Conclusion

18 C ACHE I NTERFERENCE Noticed scaling problem with Toledo algorithm Tested with matrices of size 2 n Caused conflict misses in processor cache

19 C ACHE I NTERFERENCE : EXAMPLE AMD Opteron has 64 byte cache lines and a 64 Kbyte 2-way set associative cache: 512 sets, 2 cache lines each Every 32Kbyte (or 4096 doubles) map to the same set offset set tag 056141563

20 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

21 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

22 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

23 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

24 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

25 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

26 C ACHE I NTERFERENCE : EXAMPLE 4096 elements

27 S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

28 S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

29 S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

30 S OLUTION : PAD MATRIX ROWS 4096 elements8 element pad

31 C ACHE I NTERFERENCE ( GRAPHS ) Before: After:

32 O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism  Scheduling Code size Conclusion

33 P ARALLELISM Toledo shows higher parallelism, particularly in burdened parallelism and large matrices Still doesn’t explain poor scaling of right at low numbers of cores Matrix SizeToledoRight-looking ParallelismBurdened Parallelism ParallelismBurdened Parallelism 2048x204815.815.516.012.2 4096x409638.137.434.626.0 8192x819292.691.172.857.3

34 S YSTEM F ACTORS (L OAD LATENCY ) Performance of Right relative to Toledo

35 S YSTEM F ACTORS (L OAD LATENCY ) Performance of Tile relative to Toledo

36 O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling  Code size Conclusion

37 S CHEDULING Cilk++ provides dynamic scheduler PLASMA, pthread use static schedule Compare performance under multiprogrammed workload

38 S CHEDULING G RAPH Cilk++ implementations degrade more gracefully PLASMA does OK; pthread right (“tile”) doesn’t

39 O UTLINE Overview Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size  Conclusion

40 C ODE STYLE * Includes base case wrappers Comparing different languages Expected large difference, but they are similar Complexity is in base case Base cases are shared Lines of Code ToledoRight- looking PLASMAPthread Right Just LU111121143134 Everything238257269934*

41 C ONCLUSION Cilk++ can perform competitively with optimized math libraries Cache behavior is most important factor Cilk++ shows better performance degradation with other things running Especially compared to hand-coded pthread versions Code size not a major factor

42


Download ppt "O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer."

Similar presentations


Ads by Google