CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC.

Slides:



Advertisements
Similar presentations
Method If an nxn matrix A has an LU-factorization, then the solution of AX = b can be determined by a Forward substitution followed by a Back substitution.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Solution of linear system of equations
Chapter 9 Gauss Elimination The Islamic University of Gaza
Major: All Engineering Majors Author(s): Autar Kaw
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
Part 3 Chapter 10 LU Factorization PowerPoints organized by Dr. Michael R. Gustafson II, Duke University All images copyright © The McGraw-Hill Companies,
Lecture 21: Parallel Algorithms
Linear Systems What is the Matrix?. Outline Announcements: –Homework III: due Wed. by 5, by Office Hours: Today & tomorrow, 11-1 –Ideas for Friday?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Systems of Linear Equations (Optional) Special Matrices.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Table of Contents Solving Systems of Linear Equations - Gaussian Elimination The method of solving a linear system of equations by Gaussian Elimination.
Major: All Engineering Majors Author(s): Autar Kaw
Scientific Computing Linear Systems – LU Factorization.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
High Performance Computing on the Cell Broadband Engine
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
HPEC 2004 Sparse Linear Solver for Power System Analysis using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Chemical Engineering Majors Author(s): Autar Kaw
Radar Pulse Compression Using the NVIDIA CUDA SDK
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Lecture 8 Matrix Inverse and LU Decomposition
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Direct Methods for Linear Systems Lecture 3 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.
Chapter 9 Gauss Elimination The Islamic University of Gaza
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Gaussian Elimination Industrial Engineering Majors Author(s): Autar Kaw Transforming Numerical Methods Education for.
Gaussian Elimination Mechanical Engineering Majors Author(s): Autar Kaw Transforming Numerical Methods Education for.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
PTCN-HPEC-02-1 AIR 25Sept02 MIT Lincoln Laboratory Resource Management for Digital Signal Processing via Distributed Parallel Computing Albert I. Reuther.
Gaussian Elimination Civil Engineering Majors Author(s): Autar Kaw Transforming Numerical Methods Education for STEM.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Part 3 Chapter 9 Gauss Elimination
Simultaneous Linear Equations
Spring Dr. Jehad Al Dallal
Linear Equations.
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,
Parallel Processing in ROSA II
Parallel Computers Today
Lecture 16: Parallel Algorithms I
CSCE569 Parallel Computing
Simultaneous Linear Equations
Major: All Engineering Majors Author(s): Autar Kaw
Part 3 Chapter 10 LU Factorization
Large data arrays processing on Cell Broadband Engine
Lecture 8 Matrix Inverse and LU Decomposition
The Gauss Jordan method
Presentation transcript:

CBELU - 1 James 11/23/2015 MIT Lincoln Laboratory High Performance Simulations of Electrochemical Models on the Cell Broadband Engine James Geraci HPEC Workshop September 20, 2007 This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government

MIT Lincoln Laboratory CBELU - 2 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

MIT Lincoln Laboratory CBELU - 3 James Geraci 11/23/2015 Introduction Real Time Battery State of Health Estimation The way in which a battery ages is a strong function of battery geometry. The rate of self discharge is also a function of battery geometry. Lead Acid Battery

MIT Lincoln Laboratory CBELU - 4 James Geraci 11/23/2015 Inside a Lead Acid Battery Face View Lead Dioxide Plate Face View Lead Plate Lead acid batteries are made up of stacks of lead dixode and lead electrode pairs

MIT Lincoln Laboratory CBELU - 5 James Geraci 11/23/2015 Finite volume description of battery Lead Dioxide Sulfuric Acid Lead Edge view lead dioxide plate The center volume’s physics can be influenced by the physics of the volumes to the right, the left, above, and below The 5 point stencil gives rise to banded matrix

MIT Lincoln Laboratory CBELU - 6 James Geraci 11/23/2015 Matlab spy plot of Banded Matrix Non-zero entries shown in blue Inner band around diagonal Distant narrow outer band For the physics of a lead acid battery, the matrix is not symmetric (shown) For the physics of a lead acid battery, the matrix is poorly conditioned 10^8-10^12 - need double precision, no GPGPU

MIT Lincoln Laboratory CBELU - 7 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

MIT Lincoln Laboratory CBELU - 8 James Geraci 11/23/2015 LU Decomposition LU decomposition decomposes a matrix J into a lower triangular matrix L and an upper triangular matrix U L & U can be used to solve a system of linear equations Jx = r by forward elimination back substitution –Essentially Gaussian Elimination Often used on poorly conditioned systems where ‘iterative solvers’ can’t be used. Difficult to parallelize for small systems because of the fine grain nature of the parallelism involved. Banded LU is a special case of LU –The matrix J has a special ‘banded’ data pattern.

MIT Lincoln Laboratory CBELU - 9 James Geraci 11/23/2015 Introduction Modeling battery physics LU decomposition CELL Broadband Engine Outline Introduction Out of Core Algorithm In Core Algorithm Summary

MIT Lincoln Laboratory CBELU - 10 James Geraci 11/23/2015 Cell Broadband Engine Cell Broadband Engine is a new heterogeneous multicore processor that features large internal and off chip bandwidth.

MIT Lincoln Laboratory CBELU - 11 James Geraci 11/23/2015 EIB (96B/cycle 3.2Gcycles/sec*96B/cycle = 286.1GB/sec) Cell Broadband Engine SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU SPE MFC LS SXU SPU PPU L1 L2PXU MIC 16B/cycle 47.7GB/sec 16B/cycle PPE 25GB/sec Bandwidth: 50GB/sec/SPU 25GB/sec off chip SPE Performance: 14 Gflops double

MIT Lincoln Laboratory CBELU - 12 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

MIT Lincoln Laboratory CBELU - 13 James Geraci 11/23/2015 Banded LU Out of Core Algorithm Explained i i Matrix

MIT Lincoln Laboratory CBELU - 14 James Geraci 11/23/2015 Banded LU Out of Core Algorithm Analyzed i i Multiplies: Compute Memory Accesses Subtracts: Gets: Puts: Memory Doubles Moved SynchronizationsN Compute/Memory Ratio = ½ For a 16728x16728 matrix with band size of 420, almost 22 GB of data would have to be moved. Matrix

MIT Lincoln Laboratory CBELU - 15 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

MIT Lincoln Laboratory CBELU - 16 James Geraci 11/23/2015 Peformance of Out of Core Algorithm Out of Core Algorithm outperforms UMFpack on Opteron 246 based workstation. No appreciable gain in performance past 4 SPEs Gflops for matrix size 16728x16728 w/ band size of 420Compute Time for matrix size 16728x16728 w/ band size of 420 Seconds Gflops Out of core Linear speed up Out of core Opteron 246: UMFpack Linear speed up

MIT Lincoln Laboratory CBELU - 17 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Banded LU Performance Latency Synchronization

MIT Lincoln Laboratory CBELU - 18 James Geraci 11/23/2015 Latency for DMA put 16 bytes 1024 bytes Significant Performance hit for memory access smaller than 16bytes Bandwidth limited region starts at 8x128bytes

MIT Lincoln Laboratory CBELU - 19 James Geraci 11/23/2015 SPE to main memory bandwidth 16 bytes 1024 bytes Theoretical Maximum Bandwidth 25GB/sec Theoretical maximum bandwidth can almost be achieved for larger message sizes

MIT Lincoln Laboratory CBELU - 20 James Geraci 11/23/2015 OCA Memory Access Size Dependence Out of Core performance is better when memory access is a byte multiple of 128

MIT Lincoln Laboratory CBELU - 21 James Geraci 11/23/2015 PPE/SPE Synchronization Mailboxes // barrier While(sum != NUM_SPE){ for(i < NUM_SPE){ if(NewMailBoxMessage){ readMailBox;}}} // Notify PPE spu_writech(SPU_WrOutMbox,1); // Wait for PPE spu_readch(SPU_RdInMbox); // Restart SPEs writeSPEinMboxes(); PPE SPE MFC LS SXU SPU PPU L1 L2PXU 16B/cycle Out Mailbox In Mailbox PPE EIB Mailboxes are one common method of synchronization

MIT Lincoln Laboratory CBELU - 22 James Geraci 11/23/2015 PPE/SPE Synchronization Mailboxes Round Trip Times Mailboxes using IBM SDK 2.1 C- intrinsics Mailboxes using IBM SDK 2.1 C- intrinsics & pointers to MMIO registers Mailboxes by pointers alone Standard round trip latency 16byte message IBM SDK 2.1 C-intrinsics for mailboxes do not seem to have idea performance  seconds 3.65  seconds 0.35  seconds / not reliable  seconds TCP (Gigabit) 8.08  seconds Infiniband

MIT Lincoln Laboratory CBELU - 23 James Geraci 11/23/2015 Synchronization by hybrid of C-intrinsics and pointers to MMIO registers Synchronization with IBM SDK 2.1 C-intrinsics & pointers to MMIO registers yields fairly good performance for a moderate numbers SPEs Matrix size 16728x16728 w/ band size of Number of SPEs Out of core Linear speed up 1 2 Seconds 3

MIT Lincoln Laboratory CBELU - 24 James Geraci 11/23/2015 Synchronization exclusively by IBM SDK 2.1 mailbox C intrinsics Synchronization by IBM SDK 2.1 mailbox C intrinsics alone, yields little gain for low SPE count and performance LOSS after only 4 SPEs!!! Seconds Number of SPEs SPE synchronization by SDK C intrinsics

MIT Lincoln Laboratory CBELU - 25 James Geraci 11/23/2015 Data from synchronization exclusively by pointers Seconds Number of SPEs SPE synchronization by SDK C intrinsics Synchronization by pointers alone, yields a nice speed up for all SPEs Seems to be reliability issue with reading mailbox status register via pointers

MIT Lincoln Laboratory CBELU - 26 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary Hide memory accesses Hide synchronizations

MIT Lincoln Laboratory CBELU - 27 James Geraci 11/23/2015 Banded LU In Core Algorithm Multiplies: Compute Total Accesses: Subtracts: Gets: Puts: Memory AccessesDoubles Moved Synchronizations N Compute/Memory Ratio = For a 16728x16728 matrix with band size of 420, only GB of data would have to be moved. Matrix Start up Access:

MIT Lincoln Laboratory CBELU - 28 James Geraci 11/23/2015 In Core Performance Achieves over 11% of theoretical performance MIT Lincoln Laboratory Max out core Gflops Max in core 1.23 Gflops Intel Xeon Ghz Gflops

MIT Lincoln Laboratory CBELU - 29 James Geraci 11/23/2015 In Core Performance w/ linear speed up MIT Lincoln Laboratory Performance improves as band size increases

MIT Lincoln Laboratory CBELU - 30 James Geraci 11/23/2015 IBM QS20 performance MIT Lincoln Laboratory IBM QS20 Max in core 3.15 Gflops Achieves over 11% of theoretical performance

MIT Lincoln Laboratory CBELU - 31 James Geraci 11/23/2015 ICA Memory Access Size Dependence In core performance does not show a large dependence on memory access size

MIT Lincoln Laboratory CBELU - 32 James Geraci 11/23/2015 Outline Introduction Out of Core Algorithm In Core Algorithm Summary

MIT Lincoln Laboratory CBELU - 33 James Geraci 11/23/2015 Summary / Future Work Parallel LU decomposition can benefit from the high bandwidth of the CBE –Benefit depends greatly on synchronization scheme inCore LU offers performance advantages over out of core LU –Limits on size of matrix bandwidth for inCore LU. Partial pivoting

MIT Lincoln Laboratory CBELU - 34 James Geraci 11/23/2015 Acknowledgements Out of Core Algorithm –Sudarshan Raghunathan –John Chu MIT In Core Algorithm –Jeremy Kepner MIT LL –Sharon Sacco MIT LL –Rodric Rabbah IBM