Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring 18.337 Term Project.

Slides:



Advertisements
Similar presentations
4 th order Embedded Boundary FDTD algorithm for Maxwell Equations Lingling Wu, Stony Brook University Roman Samulyak, BNL Tianshi Lu, BNL Application collaborators:
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Application: TSD-MPI Calculation of Thermal Stress Distribution By Using MPI on EumedGrid Abdallah ISSA Mazen TOUMEH Higher Institute for Applied Sciences.
1 A component mode synthesis method for 3D cell by cell calculation using the mixed dual finite element solver MINOS P. Guérin, A.M. Baudron, J.J. Lautard.
By Xinggao Xia and Jong Chul Lee. TechniqueAdditionsMultiplications/Divisions Gauss-Jordann 3 /2 Gaussian Eliminationn 3 /3 Cramer’s Rulen 4 /3 n :
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Sorting Algorithms CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department.
Parallelizing Compilers Presented by Yiwei Zhang.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Parallel implementation of RAndom SAmple Consensus (RANSAC) Adarsh Kowdle.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
STE 6239 Simulering Friday, Week 1: 5. Scientific computing: basic solvers.
Progress report on the alignment of the tracking system A. Bonissent D. Fouchez A.Tilquin CPPM Marseille Mechanical constraints from optical measurement.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
RF simulation at ASIPP Bojiang DING Institute of Plasma Physics, Chinese Academy of Sciences Workshop on ITER Simulation, Beijing, May 15-19, 2006 ASIPP.
The propagation of a microwave in an atmospheric pressure plasma layer: 1 and 2 dimensional numerical solutions Conference on Computation Physics-2006.
A new Ad Hoc Positioning System 컴퓨터 공학과 오영준.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Max-Planck-Institut für Plasmaphysik, EURATOM Association Different numerical approaches to 3D transport modelling of fusion devices Alexander Kalentyev.
Implementing Hypre- AMG in NIMROD via PETSc S. Vadlamani- Tech X S. Kruger- Tech X T. Manteuffel- CU APPM S. McCormick- CU APPM Funding: DE-FG02-07ER84730.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Thesis: On the development of numerical parallel algorithms for the insetting procedure Master of Science in Communication & Information Systems Department.
Parallel coupling: problems arising in the context of magnetic fusion John R. Cary Professor, University of Colorado CEO, Tech-X Corporation.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
Parallel Computing Presented by Justin Reschke
A Parallel Linear Solver for Block Circulant Linear Systems with Applications to Acoustics Suzanne Shontz, University of Kansas Ken Czuprynski, University.
Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Computational Methods for Kinetic Processes in Plasma Physics
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Plasma Equilibrium Reconstruction Using GPU
Course Description Algorithms are: Recipes for solving problems.
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
L Ge, L Lee, A. Candel, C Ng, K Ko, SLAC
Numerical Algorithms Quiz questions
Parallelization of Sparse Coding & Dictionary Learning
By Brandon, Ben, and Lee Parallel Computing.
Paige Thielen, ME535 Spring 2018
Course Description Algorithms are: Recipes for solving problems.
What are Multiscale Methods?
Presentation transcript:

Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring Term Project

1. MOTIVATION TORIC at 240N r x 255 N m J. Wright, PSFC, PoP, 2004 ICW IBW FW 2D RF wave analysis in Plasma for TOKAMAK operation TORIC(MPI Fortran based Code) –Using FEM for Maxwell eqns in Plasma

2. Block Tri-Diagonal system Tri-diagonal equation along radial direction Each block has poloidal components for i=1,… :,. :  Electric fields

2.1. Current Version of TORIC: Radially Serial Calculation for Block- Tridiagonal system Serial computation (Radial direction [i=1:270]) : Thomas Algorithm Parallel computation (Poloidal direction [m=0:255]) : Scalpack matrix calculation (BLACS) = _ **

2.2 The needs for parallelization of the radial direction as well as the poloidal direction e.g. (Ni=270, Nm=32,Nproc=400) Current: serial(raidal)+parallel(poloidal) time~270*(32^2/400)  2D processors distribution(20*20) 1) If Nproc>>Nm^2, then I cannot use full processors (Saturation !!) 2) Communication time increased as block size per a processor decreased Goal: parallel(radial)+parallel(poloidal) time~(270/4)*(32^2/100)  3D processors distribution(4*10*10)

2.3. Use of BLACS for 3D processor grid The need for 3-D grid 1) remove the saturation of improvement for the computation speed 2) Divide a big size of data for one block(6Nm*6Nm) in the memory of many processors Use context array in BLACS for 3D processor grid

2.4 Algorithms comparison(1) Comparison of computation time for typical algorithms of tridiagonal system H.S.Stone, ACM transactions on Mathematical Software,Vol1(1975), H.H.Wang, ACM transactions on Mathematical Software,Vol7(1981),

2.4 Algorithms comparison(2) Estimation of computation time for three algorithms by theory (set limitation for maximum as by experience) Thomas algorithm is faster below threshold(P=2^8) There exists an optimization point for P1

3. Implementation(1) Use an algorithm having both merits of divide-and-conquer method and odd-even cyclic algorithm suggested by Garaud Step 1. the serial forward reduction in each divided group P.Garaud, Mon.Not.R.Astron.Soc,391(2008)

3. Implementation(2) Step 2. Pass the blocks in the last lines and redistribute for tridiagonal forms Step 3. Odd-even cyclic reduction for the blocks in the first lines of all groups

3. Implementation(3) Step 4. Cyclic back substitution in the first lines of all groups Step 5. Serial back substitution in each group

4. Result(1)- Fast computation speed of the new solver When I use only P1 in 3D grid (e.g. [P1,P2,P3]=[7,1,1] or [255,1,1]) About two times faster than old solver Retardation of the saturation for improvement of computation speed

4. Result(2)- Good stability and accuracy of the new solver Results of electric fields by the new solver are close to the results by older solver within 0.1% error About 50 times smaller variance of results in terms of number of processors than older solver

5. Conclusions and Future works Implementation of a parallel block-tridiagonal system solver The use of the algorithm with a combination of divide-and- conquer and odd-even cyclic reduction Two times faster speed and better precision of the results by the new solver Ongoing development of the sovler for the use of full 3- dimensional grid to overcome the saturation of the speed The needs of optimization for the ratio of the 3D grid in the future

6. Questions and Suggestions