R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

Slides:



Advertisements
Similar presentations
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Advertisements

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
Acceleration on many-cores CPUs and GPUs Dinesh Manocha Lauri Savioja.
NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
A performance analysis of multicore computer architectures Michel Schelske.
Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
EEB5213 / EAB4233 Plant Process Control Systems Digital Implementation of PID Controller.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
City College of New York 1 John (Jizhong) Xiao Department of Electrical Engineering City College of New York Mobile Robot Control G3300:
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,
Scaling up R computation with high performance computing resources.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
2014 Heterogeneous many cores for medical control: Performance, Scalability, and Accuracy Madhurima Pore, Arizona State University October 10,2014 #GHC14.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Parallel Programming Models
Distributed SAR Image Change Detection with OpenCL-Enabled Spark
PERFORMANCE EVALUATIONS
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Time and Depth Imaging Algorithms in a Hardware Accelerator Paradigm
Early Results of Deep Learning on the Stampede2 Supercomputer
Employing compression solutions under openacc
Parallel Plasma Equilibrium Reconstruction Using GPU
COMPUTATIONAL MODELS.
Astronomical Data Processing & Workflow Scheduling in cloud
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Parallel Software Development with Intel Threading Analysis Tools
Parallel Programming By J. H. Wang May 2, 2017.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
The Problem Finding a needle in haystack An expert (CPU)
CLARA Based Application Vertical Elasticity
Inculcating “Parallel Programming” in UG curriculum
Ray-Cast Rendering in VTK-m
IXPUG Abstract Submission Instructions
17-Nov-18 Parallel 2D and 3D Acoustic Modeling Application for hybrid computing platform of PARAM Yuva II Abhishek Srivastava, Ashutosh Londhe*, Richa.
Development of the Nanoconfinement Science Gateway
Early Results of Deep Learning on the Stampede2 Supercomputer
Compiler Back End Panel
Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:
STUDY AND IMPLEMENTATION
Scalable Parallel Interoperable Data Analytics Library
Compiler Back End Panel
Massive Parallelization of SAT Solvers
Alternative Processor Panel Results 2008
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Hybrid Programming with OpenMP and MPI
Professor Ioana Banicescu CSE 8843
Department of Intelligent Systems Engineering
Effective Parallelization Strategies for Scalable, High
Presentation transcript:

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde Experience of Porting and Optimization of Seismic Modelling on Multi and Many Cores of Hybrid Computing Cluster We P4 14 Introduction Seismic modelling is a technique for simulation of seismic response for a given geological subsurface model and shot receiver geometry. It is based on finite-difference solution of second order wave equation. Till last decade, the finite difference based seismic modelling application had fairly scaled on single processor based parallel clusters using MPI. After the advent of accelerators like Nvidia’s GPU’s and coprocessors like Intel’s Xeon Phi with many cores, the MPI only programming model became inefficient in terms of performance. There is a need felt to use hybrid programming models and apply various optimizations to enhance the performance of the application. In this paper, we report our experience of porting and optimization of legacy 2D acoustic modelling application on hybrid architecture of PARAM Yuva II. This application solves seismic acoustic wave equation using finite difference method which is second order accurate in time and fourth order in space. The initial application was MPI based and used domain decomposition approach for parallelization. The optimization and porting details on Xeon and Xeon Phi along with comparative performance study results are presented here. Acknowledgements Authors are thankful to Centre for Development of Advanced Computing (CDAC), Pune, for permission to publish this work and grateful to Mr Arvind Amin from Intel for his expert advice and initial guidance. References Dongarra, J. et al. [2011] The international exascale software project roadmap. Int. J. High Perform. Comput. Appl., 25(1), 3 – 60. Fang, J., Sips, H., Zhang, L., Xu, C., Che, Y. and Varbanescu, A.L. [2014] Test-driving intel xeon phi. Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, 137–148. Subrata, C., Sudhakar, Y., Suhas, P. and Dheeraj, B. [2003] Parallelization strategies for seismic modeling algorithms. J. Ind. Geophys. Union, 7(1), 11 – 14. Sudhakar, Y., Dheeraj, B., Subrata, C. and Suhas, P. [2002] Finite difference forward modeling for complex geological models. SEG Technical Program Expanded Abstracts, 1987 – 1990. Zhebel, E., Minisini, S., Kononov, A. and Mulder, W. [2013] Performance and scalability of finite-difference and finite-element wave-propagation modeling on intel’ s xeon phi. SEG Technical Program Expanded Abstracts, 3386 – 3390. Optimizations on Xeon Porting on Xeon Phi Compute time of the application on Xeon due to augmentative optimizations and its relative speedup. Compute time of the application on Xeon Phi in native and symmetric modes. Methodology Initially, this application was using MPI for parallelization and the approach was to divide computation among processors using domain decomposition. Following steps are taken to run this application on hybrid architecture and compare the performance on multi cores of Xeon cluster and many cores of Xeon Phi coprocessor : The application was profiled to identify the hotspots. The analysis for application using 2×2 domain decomposition indicated that the wave propagation computation function is the most compute intensive part of the application. The major computation is 80% and MPI communication is 17 % of the total compute time. OpenMP was introduced at the wave propagation loop to achieve data decomposition using multiple cores of Xeon. Various optimization techniques were applied to enhance the performance of the application. The optimized application was ported on many cores enabled Xeon Phi using native and symmetric mode. Scalablity and Efficiency on Xeon and Xeon Phi Before Optimization After Optimization Application Outcome The System – PARAM Yuva II Conclusions Optimizations and porting of legacy finite difference based seismic modelling application on Xeon and Xeon Phi was successfully demonstrated using PARAM Yuva II. Performance of 5.5X was achieved on Xeon due to optimizations. The performance on Xeon was better than Xeon Phi, but after optimizations, the compute time of Xeon Phi using native mode and Xeon using single node were comparable for 2X8 domain decomposition. Maximum efficiency achieved for Xeon is 46% and 8% for Xeon Phi wherein it does not improve further with increase in domain decomposition. Compute time are presented for different symmetric and native executions on Xeon Phi. In symmetric mode we got comparable and better compute time with Xeon in few domain decompositions. As seismic modelling is a key application for advanced applications like RTM and FWI, further exploration for performance gain of such applications is required on similar hardware platform. A hybrid computing cluster with peak performance of 520.4 TF. Each node have two Intel Xeon E5-2670 (Sandybridge) processors and Xeon Phi 5110P coprocessors. Seismic modelling outcome (a) Input velocity model (b) modelling parameters (c) synthetic seismogram for single shot location (d) wave propagation snapshots at time 0.1 sec, 0.2 s, 0.3 s and 0.4 s.