Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo.

Slides:



Advertisements
Similar presentations
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Advertisements

1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Compiler Challenges for High Performance Architectures
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
CS240A: Conjugate Gradients and the Model Problem.
1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper ( )
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Scalable Computational Methods in Quantum Field Theory Advisors: Hemmendinger, Reich, Hiller (UMD) Jason Slaunwhite Computer Science and Physics Senior.
High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids * Toshiyuki Imamura 13 今村俊幸, Thanks to Susumu Yamada 23, Takuma.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator Kengo Nakajima Supercomputing Division,
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Introduction to Parallel Finite Element Method using GeoFEM/HPC-MW Kengo Nakajima Dept. Earth & Planetary Science The University of Tokyo VECPAR’06 Tutorial:
PpOpen-HPC Open Source Infrastructure for Development and Execution of Large-Scale Scientific Applications with Automatic Tuning (AT) Kengo Nakajima Information.
Parallel Visualization of Large-Scale Datasets for the Earth Simulator Li Chen Issei Fujishiro Kengo Nakajima Research Organization for Information Science.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Parallelizing Gauss-Seidel Solver/Pre-conditioner Aim: To parallelize a Gauss-Seidel Solver, which can be used as a pre-conditioner for the finite element.
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,
Parallel Iterative Solvers with the Selective Blocking Preconditioning for Simulations of Fault-Zone Contact Kengo Nakajima GeoFEM/RIST, Japan. 3rd ACES.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Simulating complex surface flow by Smoothed Particle Hydrodynamics & Moving Particle Semi-implicit methods Benlong Wang Kai Gong Hua Liu
Parallel Solution of the Poisson Problem Using MPI
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
A few words on locality and arrays
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Lecture 38: Compiling for Modern Architectures 03 May 02
University of California, Berkeley
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Hui Liu University of Calgary
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Architecture
Xing Cai University of Oslo
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Mont-Blanc EUG: Dassault Aviation's FEM use-case
Course Outline Introduction in algorithms and applications
Parallel Programming By J. H. Wang May 2, 2017.
Prof. Zhang Gang School of Computer Sci. & Tech.
Artificial Intelligence
Array Processor.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Innovative Multigrid Methods
GPU Implementations for Finite Element Methods
Hybrid Programming with OpenMP and MPI
Hybrid Parallel Programming
Jacobi Project Salvatore Orlando.
Hybrid Parallel Programming
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Programming Parallel Computers
Presentation transcript:

Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo. Parallel preconditioned iterative solvers for FEM-type applications Optimization on the Earth Simulator and other SMP cluster type architectures Flat MPI, OpenMP/MPI Hybrid Reordering for parallel/vector processing Multicoloring (MC) RCM, CM-RCM (Cyclic Multicolor) Basically, convergence is faster if the number of color is larger … Smaller vector length ⇒ poor performance on vector processors. Synchronization overhead of OpenMP DJDS provides data locality if the color number increases. On scalar processors, performance may be improved as color number increases (both for flat MPI and Hybrid). Hitachi SR11000 ●DJDS ■DCRS ES IBM SP3 Effect of color# and matrix storage (PGA model, 1 SMP node, OpenMP) MC RCM CM-RCM Matrix storage DJDS: Descending-order Jagged Diagonal Storage) DCRS: Descending-order Compressed Row Storage do i= 1, N k1= index(i-1)+1 k2= index(i) do k= k1, k2 kk= item(k) Y(i)= Y(i)+A(k)*x(k) enddo do j= 1, NCON do i= 1, NN(j) k = index(j-1) + i kk= item (k) Y(i)= Y(i)+A(k)*x(k) enddo DJDS with long innermost loops is suitable for vector processors. Reduction type loop of DCRS is more suitable for cache-based scalar processor because of its localized operation. For well-conditioned problems, difference between MC/RCM/CM-RCM and effect of color number is not so significant. ES ●DJDS ■DCRS ▲No Reordering ●MC ●CM-RCM ▲RCM 3D elastic cube with uniform meshes (106nodes, 3x106 DOF) Single SMP node with OpenMP, DJDS Hitachi SR11000 Effect of Reordering on ES for 3D Linear-Elastic Problem (1 SMP node, OpenMP) Complicated PGA (Pin Grid Array) model But it’s significant for ill-conditioned problems 61 pins 956,128 elements 1,012,354 nodes (3,037,062 DOF) Single SMP node with OpenMP, DJDS ES ●MC ●CM-RCM ▲RCM ES Hitachi SR11000 Number of independent sets for RCM is 2985. Min. number of colors for independent CM-RCM is 2381. Max. ratio: 30:1 CM-RCM provides the most robust and efficient convergence for both of vector and scalar processors. complicated geometries Hitachi SR11000