Presentation is loading. Please wait.

Presentation is loading. Please wait.

CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

Similar presentations


Presentation on theme: "CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC."— Presentation transcript:

1 CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC Workshop September 20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

2 MIT Lincoln Laboratory CBELU - 2 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

3 MIT Lincoln Laboratory CBELU - 3 James Geraci 11/23/2015 Introduction Real Time Battery State of Health Estimation The way in which a battery ages is a strong function of battery geometry. The rate of self discharge is also a function of battery geometry. Lead Acid Battery

4 MIT Lincoln Laboratory CBELU - 4 James Geraci 11/23/2015 Inside a Lead Acid Battery Face View Lead Dioxide Plate Face View Lead Plate Lead acid batteries are made up of stacks of lead dixode and lead electrode pairs

5 MIT Lincoln Laboratory CBELU - 5 James Geraci 11/23/2015 Finite volume description of battery Lead Dioxide Sulfuric Acid Lead Edge view lead dioxide plate The center volume’s physics can be influenced by the physics of the volumes to the right, the left, above, and below The 5 point stencil gives rise to banded matrix

6 MIT Lincoln Laboratory CBELU - 6 James Geraci 11/23/2015 Matlab spy plot of Banded Matrix Non-zero entries shown in blue Inner band around diagonal Distant narrow outer band For the physics of a lead acid battery, the matrix is not symmetric (shown) For the physics of a lead acid battery, the matrix is poorly conditioned 10^8-10^12 - need double precision, no GPGPU

7 MIT Lincoln Laboratory CBELU - 7 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

8 MIT Lincoln Laboratory CBELU - 8 James Geraci 11/23/2015 LU Decomposition LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U L & U can be used to solve a system of linear equations Jx = r by forward elimination back substitution –Essentially Gaussian Elimination Often used on poorly conditioned systems where ‘iterative solvers’ can’t be used. Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved. Banded LU is a special case of LU –The matrix J has a special ‘banded’ data pattern.

9 MIT Lincoln Laboratory CBELU - 9 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

10 MIT Lincoln Laboratory CBELU - 10 James Geraci 11/23/2015 Cell Broadband Engine Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.

11 MIT Lincoln Laboratory CBELU - 11 James Geraci 11/23/2015 EIB (96B/cycle 3.2Gcycles/sec*96B/cycle = 286.1GB/sec) Cell Broadband Engine SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU PPU L1 L2PXU MIC 16B/cycle 47.7GB/sec 16B/cycle PPE 25GB/sec Bandwidth: 50GB/sec/SPU 25GB/sec off chip SPE Performance: 14 Gflops double

12 MIT Lincoln Laboratory CBELU - 12 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

13 MIT Lincoln Laboratory CBELU - 13 James Geraci 11/23/2015 Banded LU Out of Core Algorithm Explained i i Matrix

14 MIT Lincoln Laboratory CBELU - 14 James Geraci 11/23/2015 Banded LU Out of Core Algorithm Analyzed i i Multiplies: Compute Memory Accesses Subtracts: Gets: Puts: Memory Doubles Moved SynchronizationsN Compute/Memory Ratio = ½ For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved. Matrix

15 MIT Lincoln Laboratory CBELU - 15 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

16 MIT Lincoln Laboratory CBELU - 16 James Geraci 11/23/2015 Peformance of Out of Core Algorithm Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation. No appreciable gain in performance past 4 SPEs Gflops for matrix size 16728x16728 w/ band size of 420Compute Time for matrix size 16728x16728 w/ band size of 420 Seconds 2 1 3 14562 3 Gflops Out of core Linear speed up Out of core Opteron 246: UMFpack Linear speed up 3 4 5612 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

17 MIT Lincoln Laboratory CBELU - 17 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

18 MIT Lincoln Laboratory CBELU - 18 James Geraci 11/23/2015 Latency for DMA put 16 bytes 1024 bytes Significant Performance hit for memory access smaller than 16bytes Bandwidth limited region starts at 8x128bytes

19 MIT Lincoln Laboratory CBELU - 19 James Geraci 11/23/2015 SPE to main memory bandwidth 16 bytes 1024 bytes Theoretical Maximum Bandwidth 25GB/sec Theoretical maximum bandwidth can almost be achieved for larger message sizes

20 MIT Lincoln Laboratory CBELU - 20 James Geraci 11/23/2015 OCA Memory Access Size Dependence Out of Core performance is better when memory access is a byte multiple of 128

21 MIT Lincoln Laboratory CBELU - 21 James Geraci 11/23/2015 PPE/SPE Synchronization Mailboxes // barrier While(sum != NUM_SPE){ for(i < NUM_SPE){ if(NewMailBoxMessage){ readMailBox;}}} // Notify PPE spu_writech(SPU_WrOutMbox,1); // Wait for PPE spu_readch(SPU_RdInMbox); // Restart SPEs writeSPEinMboxes(); PPE SPE MFC LS SXU SPU PPU L1 L2PXU 16B/cycle Out Mailbox In Mailbox PPE EIB Mailboxes are one common method of synchronization

22 MIT Lincoln Laboratory CBELU - 22 James Geraci 11/23/2015 PPE/SPE Synchronization Mailboxes Round Trip Times Mailboxes using IBM SDK 2.1 C- intrinsics Mailboxes using IBM SDK 2.1 C- intrinsics & pointers to MMIO registers Mailboxes by pointers alone Standard round trip latency 16byte message IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance. 6.24  seconds 3.65  seconds 0.35  seconds / not reliable 58.33  seconds TCP (Gigabit) 8.08  seconds Infiniband

23 MIT Lincoln Laboratory CBELU - 23 James Geraci 11/23/2015 Synchronization by hybrid of C-intrinsics and pointers to MMIO registers Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registers yields fairly good performance for a moderate numbers SPEs Matrix size 16728x16728 w/ band size of 420 123 456 Number of SPEs Out of core Linear speed up 1 2 Seconds 3

24 MIT Lincoln Laboratory CBELU - 24 James Geraci 11/23/2015 Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!! Seconds 23 456 Number of SPEs SPE synchronization by SDK C intrinsics 0.1 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

25 MIT Lincoln Laboratory CBELU - 25 James Geraci 11/23/2015 Data from synchronization exclusively by pointers Seconds 23 456 Number of SPEs SPE synchronization by SDK C intrinsics 0.1 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Synchronization by pointers alone, yields a nice speed up for all SPEs Seems to be reliability issue with reading mailbox status register via pointers

26 MIT Lincoln Laboratory CBELU - 26 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Hide memory accesses Hide synchronizations

27 MIT Lincoln Laboratory CBELU - 27 James Geraci 11/23/2015 Banded LU In Core Algorithm Multiplies: Compute Total Accesses: Subtracts: Gets: Puts: Memory AccessesDoubles Moved Synchronizations N Compute/Memory Ratio = For a 16728x16728 matrix with band size of 420, only 0.158 GB of data would have to be moved. Matrix Start up Access:

28 MIT Lincoln Laboratory CBELU - 28 James Geraci 11/23/2015 In Core Performance Achieves over 11% of theoretical performance MIT Lincoln Laboratory Max out core 0.583 Gflops Max in core 1.23 Gflops Intel Xeon 5160 3.0Ghz 0.377 Gflops

29 MIT Lincoln Laboratory CBELU - 29 James Geraci 11/23/2015 In Core Performance w/ linear speed up MIT Lincoln Laboratory Performance improves as band size increases

30 MIT Lincoln Laboratory CBELU - 30 James Geraci 11/23/2015 IBM QS20 performance MIT Lincoln Laboratory IBM QS20 Max in core 3.15 Gflops Achieves over 11% of theoretical performance

31 MIT Lincoln Laboratory CBELU - 31 James Geraci 11/23/2015 ICA Memory Access Size Dependence In core performance does not show a large dependence on memory access size

32 MIT Lincoln Laboratory CBELU - 32 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary

33 MIT Lincoln Laboratory CBELU - 33 James Geraci 11/23/2015 Summary / Future Work Parallel LU decomposition can benefit from the high bandwidth of the CBE –Benefit depends greatly on synchronization scheme inCore LU offers performance advantages over out of core LU –Limits on size of matrix bandwidth for inCore LU. Partial pivoting

34 MIT Lincoln Laboratory CBELU - 34 James Geraci 11/23/2015 Acknowledgements Out of Core Algorithm –Sudarshan Raghunathan –John Chu MIT In Core Algorithm –Jeremy Kepner MIT LL –Sharon Sacco MIT LL –Rodric Rabbah IBM


Download ppt "CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC."

Similar presentations


Ads by Google