Increasing Performance of Commercial Reservoir Simulators by Core Skew Allocation José S. A. Cavalcante Filho, Thomas D.S. Oliveira, Silvio R.R. Costa, Luis V. M. Ribas, Margareth N. Cruz, PETROBRAS S.A. Daniel Dias, Ynigo Zamudio, Schlumberger Ltd Myrian C.A. Costa, Albino A. Aveleda, Alvaro L.G.A. Coutinho High Performance Computing Center COPPE-Federal University of Rio de Janeiro
What can we do if we can’t mess with the source code? Advances in multi-core processors: how to increase performance? Commercial software Legacy software No instrumentation nor code modification allowed
Issues on Multicore Performance Multi-core regime: When the memory system is saturated, the order and pattern of data accesses becomes a performance determining factor. "We varied both the number of nodes and the number of cores per node, and found that the efficiency (performance as a function of total cores) was largely independent of the number of nodes and extremely dependent on cores per node" “Core skew” effects: Because each core might be in a slightly different phase of execution, different functions may be running on different cores at the same time “Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications”, J. Diamond et al., 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
An Example: Rayleigh-Benard 4:1:1 - 501×125×125 mesh Elements..............: 39,140,625 Nodes.................: 7,969,752 Edges.................: 43,833,636 Flow equations........: 31,879,008 Temperature equations.: 7,642,824 Time steps............: 2,954 EdgeCFD solver on Marte, a Dell Cluster running 64 cores, MPI-P2P Elias, R. N., Camata, J. J., Aveleda, A. A., Coutinho, A. L. G. A., Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems. Lecture Notes in Computer Science, 2011. v. 6449. p. 306-313.
Dell Cluster, 64 cores, core skew allocation Communication graph Similar results can be found in: Jeff Diamond, Byoung-Do Kim, Martin Burtscher, Steve Keckler, Keshav Pingali and Jim Browne, Multicore Optimization for Ranger,TeraGrid09, http://www.teragrid.org/tg09/ Time spent in 10 time steps
Parallel Reservoir Simulator ECLIPSE The ECLIPSE* family of reservoir simulation software specialized in blackoil, compositional and thermal finite-volume reservoir simulation, as well as streamline reservoir simulation More than 30 years in the market ECLIPSE Black-oil simulation: three-phase, 3D reservoir simulation supporting extensive well controls, field operations planning, and comprehensive enhanced oil recovery (EOR) schemes. "Schlumberger – Reservoir Simulation” - http://www.slb.com/services/software/reseng.aspx
ECLIPSE Features (cont’d) ECLIPSE Compositional simulation: Describes reservoir fluid phase behavior and compositional changes associated with multi-component hydrocarbon flow. ECLIPSE FrontSim simulation: Models multiphase fluid flow along streamlines; enable better visualization of fluid flow in the reservoir. ECLIPSE Thermal Simulation: Simulates a wide range of thermal recovery processes, including steam-assisted gravity drainage, toe to heel air injection, and cold heavy oil production with sand
ECLIPSE "Schlumberger – Reservoir Simulation” - http://www.slb.com/services/software/reseng.aspx
New Generation of Parallel Reservoir Simulators INTERSECT (IX) Multistage parallel linear solver framework Two-stage CPR1 (Constraint Pressure Residual) scheme for large-scale parallel runs Parallel Algebraic Multigrid (PAMG) solver with a F-GMRES outer iteration as the first stage preconditioner and Parallel ILU-type scheme for the second stage SPE 96809 - “Parallel Scalable Unstructured CPR-Type Linear Solver for Reservoir Simulation”, H. Cao et al., SPE Annual Technical Conference and Exhibition, 9-12 October 2005, Dallas, Texas
INTERSECT (IX) Architectural Features Static and dynamic load balance on unstructured and structured grids Simulator architecture supports black-oil and compositional models within a general formulation Distribution of data between available processors is determined by PARMETIS Communication between processors is based on MPI (OOMPI) " SPE 93274 - “An Extensible Architecture for Next Gaeneration Scalable Parallel Reservoir Simulation”, D. Debaun et al., 19th SPE Reservoir Simulation Symposium, 2005, The Woodlands, Texas
INTERSECT (IX) Architectural layers in the simulator system SPE 93274 - “An Extensible Architecture for Next Generation Scalable Parallel Reservoir Simulation”, D. Debaun et al., 19th SPE Reservoir Simulation Symposium, 2005, The Woodlands, Texas
Multicore Processors Nehalem Family Harpertown Family 3 way 1333MHz QPI memory access 8MB L3 cache shared by all cores Smart cache LLC allocation Allocation of 2 cores ???? Harpertown Family 1333MHz FSB memory access 12MB L2 cache shared by each pair of cores Allocation of 2 cores??? Multicore bottlenecks: L3 cache capacity off-chip bandwidth DRAM banks.
Benchmarks B1 – Benchmark 1 Black-oil model Significant amounts of free gas and very thin cells 1 million active cells B2 – Benchmark 2 Compositional model Realistic reservoir 2 million cells
Benchmark B1 Horizontal permeability distribution in B1
Benchmark B1 - Output
Benchmark B2 Horizontal permeability distribution in B2
Benchmark B2 - Output
Clusters Marte Galileu DELL cluster PowerEdge M1000e 16 nodes: 128 cores Intel Xeon E5450 (Harpertown) 256 GB RAM memory InfiniBand 20Gbps DDR Full clos topology Galileu SunBlade 6048 896 nodes: 7168 cores Intel Xeon X5560 (Nehalem) 21 TB RAM memory InfiniBand 40Gbps (QDR) Torus-3D topology
Benchmark B1 - ECLIPSE 8 cores/node running in full-core mode needs 8 nodes 4 cores/node running in half-core mode needs 16 nodes
Benchmark B2 - ECLIPSE
Benchmark B1 – INTERSECT (IX)
Benchmark B2 – INTERSECT (IX)
Conclusions Core skew allocation reduces execution times in all cases Effects are less pronounced in Nehalem Reduction of hours or days in complex models on production runs However, under utilization of resources is an economic issue Need to incorporate new multi-core optimization techniques in the code