José S. A. Cavalcante Filho, Thomas D.S. Oliveira, Silvio R.R. Costa,

Slides:

Advertisements

Similar presentations

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Advertisements

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Dual Mesh Method in Upscaling Pascal Audigane and Martin Blunt Imperial College London SPE Reservoir Simulation Symposium, Houston, 3-5 February 2003.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

Steady Aeroelastic Computations to Predict the Flying Shape of Sails Sriram Antony Jameson Dept. of Aeronautics and Astronautics Stanford University First.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

C M C C Centro Euro-Mediterraneo per i Cambiamenti Climatici COSMO General Meeting - September 8th, 2009 COSMO WG 2 - CDC 1 An implicit solver based on.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

High performance computing for Darcy compositional single phase fluid flow simulations L.Agélas, I.Faille, S.Wolf, S.Réquena Institut Français du Pétrole.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Scalable systems for reservoir modeling on modern hardware platforms Dmitry Eydinov SPE London. November, 24 th 2015.

By Islam Atta Supervised by Dr. Ihab Talkhan

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Parallel Programming Models

Hui Liu University of Calgary

Computational Techniques for Efficient Carbon Nanotube Simulation

Hasan Nourdeen Martin Blunt 10 Jan 2017

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

CS427 Multicore Architecture and Parallel Computing

Dual Mesh Method in Dynamic Upscaling

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Parallel Programming By J. H. Wang May 2, 2017.

Multi-core evaluation and performance analysis of the ECLIPSE and INTERSECT Reservoir simulation codes Owen Brazell, Steve Messenger, Najib Abusalbi (Schlumberger)

DPW-4 Results For NSU3D on LaRC Grids

Unstructured Grids at Sandia National Labs

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow

A RESERVOIR MANAGEMENT REVOLUTION

Scalable Parallel Interoperable Data Analytics Library

CLUSTER COMPUTING.

GENERAL VIEW OF KRATOS MULTIPHYSICS

ArcEOR A posteriori error estimate tools to enhance the performance of

TeraScale Supernova Initiative

Hybrid Programming with OpenMP and MPI

Presented By: Darlene Banta

Computational Techniques for Efficient Carbon Nanotube Simulation

Department of Computer Science, University of Tennessee, Knoxville

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Parallel Implementation of Adaptive Spacetime Simulations A

2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Increasing Performance of Commercial Reservoir Simulators by Core Skew Allocation José S. A. Cavalcante Filho, Thomas D.S. Oliveira, Silvio R.R. Costa, Luis V. M. Ribas, Margareth N. Cruz, PETROBRAS S.A. Daniel Dias, Ynigo Zamudio, Schlumberger Ltd Myrian C.A. Costa, Albino A. Aveleda, Alvaro L.G.A. Coutinho High Performance Computing Center COPPE-Federal University of Rio de Janeiro

What can we do if we can’t mess with the source code? Advances in multi-core processors: how to increase performance? Commercial software Legacy software No instrumentation nor code modification allowed

Issues on Multicore Performance Multi-core regime: When the memory system is saturated, the order and pattern of data accesses becomes a performance determining factor. "We varied both the number of nodes and the number of cores per node, and found that the efficiency (performance as a function of total cores) was largely independent of the number of nodes and extremely dependent on cores per node" “Core skew” effects: Because each core might be in a slightly different phase of execution, different functions may be running on different cores at the same time “Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications”, J. Diamond et al., 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

An Example: Rayleigh-Benard 4:1:1 - 501×125×125 mesh Elements..............: 39,140,625 Nodes.................: 7,969,752 Edges.................: 43,833,636 Flow equations........: 31,879,008 Temperature equations.: 7,642,824 Time steps............: 2,954 EdgeCFD solver on Marte, a Dell Cluster running 64 cores, MPI-P2P Elias, R. N., Camata, J. J., Aveleda, A. A., Coutinho, A. L. G. A., Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems. Lecture Notes in Computer Science, 2011. v. 6449. p. 306-313.

Dell Cluster, 64 cores, core skew allocation Communication graph Similar results can be found in: Jeff Diamond, Byoung-Do Kim, Martin Burtscher, Steve Keckler, Keshav Pingali and Jim Browne, Multicore Optimization for Ranger,TeraGrid09, http://www.teragrid.org/tg09/ Time spent in 10 time steps

Parallel Reservoir Simulator ECLIPSE The ECLIPSE* family of reservoir simulation software specialized in blackoil, compositional and thermal finite-volume reservoir simulation, as well as streamline reservoir simulation More than 30 years in the market ECLIPSE Black-oil simulation: three-phase, 3D reservoir simulation supporting extensive well controls, field operations planning, and comprehensive enhanced oil recovery (EOR) schemes. "Schlumberger – Reservoir Simulation” - http://www.slb.com/services/software/reseng.aspx

ECLIPSE Features (cont’d) ECLIPSE Compositional simulation: Describes reservoir fluid phase behavior and compositional changes associated with multi-component hydrocarbon flow. ECLIPSE FrontSim simulation: Models multiphase fluid flow along streamlines; enable better visualization of fluid flow in the reservoir. ECLIPSE Thermal Simulation: Simulates a wide range of thermal recovery processes, including steam-assisted gravity drainage, toe to heel air injection, and cold heavy oil production with sand

ECLIPSE "Schlumberger – Reservoir Simulation” - http://www.slb.com/services/software/reseng.aspx

New Generation of Parallel Reservoir Simulators INTERSECT (IX) Multistage parallel linear solver framework Two-stage CPR1 (Constraint Pressure Residual) scheme for large-scale parallel runs Parallel Algebraic Multigrid (PAMG) solver with a F-GMRES outer iteration as the first stage preconditioner and Parallel ILU-type scheme for the second stage SPE 96809 - “Parallel Scalable Unstructured CPR-Type Linear Solver for Reservoir Simulation”, H. Cao et al., SPE Annual Technical Conference and Exhibition, 9-12 October 2005, Dallas, Texas

INTERSECT (IX) Architectural Features Static and dynamic load balance on unstructured and structured grids Simulator architecture supports black-oil and compositional models within a general formulation Distribution of data between available processors is determined by PARMETIS Communication between processors is based on MPI (OOMPI) " SPE 93274 - “An Extensible Architecture for Next Gaeneration Scalable Parallel Reservoir Simulation”, D. Debaun et al., 19th SPE Reservoir Simulation Symposium, 2005, The Woodlands, Texas

INTERSECT (IX) Architectural layers in the simulator system SPE 93274 - “An Extensible Architecture for Next Generation Scalable Parallel Reservoir Simulation”, D. Debaun et al., 19th SPE Reservoir Simulation Symposium, 2005, The Woodlands, Texas

Multicore Processors Nehalem Family Harpertown Family 3 way 1333MHz QPI memory access 8MB L3 cache shared by all cores Smart cache LLC allocation Allocation of 2 cores ???? Harpertown Family 1333MHz FSB memory access 12MB L2 cache shared by each pair of cores Allocation of 2 cores??? Multicore bottlenecks: L3 cache capacity off-chip bandwidth DRAM banks.

Benchmarks B1 – Benchmark 1 Black-oil model Significant amounts of free gas and very thin cells 1 million active cells B2 – Benchmark 2 Compositional model Realistic reservoir 2 million cells

Benchmark B1 Horizontal permeability distribution in B1

Benchmark B1 - Output

Benchmark B2 Horizontal permeability distribution in B2

Benchmark B2 - Output

Clusters Marte Galileu DELL cluster PowerEdge M1000e 16 nodes: 128 cores Intel Xeon E5450 (Harpertown) 256 GB RAM memory InfiniBand 20Gbps DDR Full clos topology Galileu SunBlade 6048 896 nodes: 7168 cores Intel Xeon X5560 (Nehalem) 21 TB RAM memory InfiniBand 40Gbps (QDR) Torus-3D topology

Benchmark B1 - ECLIPSE 8 cores/node running in full-core mode needs 8 nodes 4 cores/node running in half-core mode needs 16 nodes

Benchmark B2 - ECLIPSE

Benchmark B1 – INTERSECT (IX)

Benchmark B2 – INTERSECT (IX)

Conclusions Core skew allocation reduces execution times in all cases Effects are less pronounced in Nehalem Reduction of hours or days in complex models on production runs However, under utilization of resources is an economic issue Need to incorporate new multi-core optimization techniques in the code