High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids * Toshiyuki Imamura 13 今村俊幸, Thanks to Susumu Yamada 23, Takuma.

Slides:



Advertisements
Similar presentations
Using PALM to integrate variational data assimilation algorithms with OPA ECMWF, 29 june 2005, Reading. Nicolas Daget CERFACS, Toulouse.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Introduction to Openmp & openACC
Thoughts on Shared Caches Jeff Odom University of Maryland.
One-day Meeting, INI, September 26th, 2008 Role of spectral turbulence simulations in developing HPC systems YOKOKAWA, Mitsuo Next-Generation Supercomputer.
Applied Linear Algebra - in honor of Hans SchneiderMay 25, 2010 A Look-Back Technique of Restart for the GMRES(m) Method Akira IMAKURA † Tomohiro SOGABE.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
MA5233: Computational Mathematics
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CS240A: Conjugate Gradients and the Model Problem.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
The hybird approach to programming clusters of multi-core architetures.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Introduction Computational Challenges Serial Solutions Distributed Memory Solution Shared Memory Solution Parallel Analysis Conclusion Introduction: 
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Basis Light-Front Quantization: a non-perturbative approach for quantum field theory Xingbo Zhao With Anton Ilderton, Heli Honkanen, Pieter Maris, James.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
1 Discussions on the next PAAP workshop, RIKEN. 2 Collaborations toward PAAP Several potential topics : 1.Applications (Wave Propagation, Climate, Reactor.
Scalable Computational Methods in Quantum Field Theory Advisors: Hemmendinger, Reich, Hiller (UMD) Jason Slaunwhite Computer Science and Physics Senior.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
PiCAP: A Parallel and Incremental Capacitance Extraction Considering Stochastic Process Variation Fang Gong 1, Hao Yu 2, and Lei He 1 1 Electrical Engineering.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
ESA living planet symposium 2010 ESA living planet symposium 28 June – 2 July 2010, Bergen, Norway GOCE data analysis: realization of the invariants approach.
Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Parallel Solution of the Poisson Problem Using MPI
CS240A: Conjugate Gradients and the Model Problem.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring Term Project.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
23/5/20051 ICCS congres, Atlanta, USA May 23, 2005 The Deflation Accelerated Schwarz Method for CFD C. Vuik Delft University of Technology
What Programming Paradigms and algorithms for Petascale Scientific Computing, a Hierarchical Programming Methodology Tentative Serge G. Petiton June 23rd,
“NanoElectronics Modeling tool – NEMO5” Jean Michel D. Sellier Purdue University.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Computing Presented by Justin Reschke
Linear Scaling Quantum Chemistry Richard P. Muller 1, Bob Ward 2, and William A. Goddard, III 1 1 Materials and Process Simulation Center California Institute.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Using Neumann Series to Solve Inverse Problems in Imaging Christopher Kumar Anand.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Xing Cai University of Oslo
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Performance Evaluation of Adaptive MPI
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
RKPACK A numerical package for solving large eigenproblems
Presentation transcript:

High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids * Toshiyuki Imamura 13 今村俊幸, Thanks to Susumu Yamada 23, Takuma Kano 2, and Masahiko Machida 23 Takuma Kano 2, and Masahiko Machida UEC (University of Electro-Communications 電気通信大学 ) 2. CCSE JAEA (Japan Atomic Energy Agency) 3. CREST JST (Japan Science Technology)

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)Outline I. Physics, Review of Quantum Simulation II. Mathematics, Numerical Algorithm III. Grand Challenge, Parallel Computing on ES IV. Numerical Results V. Conclusion

I. Physics, Review of Quantum Simulation, etc.

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) S W’ I S W down-sizing Crossover from Classical to Quantum ??? (1/2) 1.1, Quantum Simulation (1/2) Classical Equation of Motion Schroedinger Equation

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Numerical Simulation for Coupled Schrodinger Eq. α : Coupling Requirement of Exact Diagonalization for the Hamiltonian 1.2, Quantum Simulation (2/2) β : 1/Mass ∝ 1 / W H : Spectral expansion by {u n } eigenvecs.  : possible state not a value but a vector! Numerical method to solve the above equation

II. Mathematics, Numerical Algorithm, etc.

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.1 Krylov Subspace Iteration Lanczos (Traditional method) Lanczos (Traditional method) Krylov+GS : Simple, but shift+invert version is needed Krylov+GS : Simple, but shift+invert version is needed LOBPCG (Locally Optimal Block PCG) LOBPCG (Locally Optimal Block PCG) {Krylov base, Ritz vector, prior vector} : CG approach {Krylov base, Ritz vector, prior vector} : CG approach **Restart at every iteration** **INVERSE-free** -> Less Communication LOBPCG Lanczos

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.2 LOBPCG Costly! Since the block is updated at every iteration, MV operation is also required!! Costly! Since the block is updated at every iteration, MV operation is also required!! 1*MV / every iteration 3*MV / every iteration Other Difficulties in implementation Breakdown of linear independency make our own DSYGV using LDL and deflation (not Cholesky) make our own DSYGV using LDL and deflation (not Cholesky) Growth of numerical error in {W,X,P} Growth of numerical error in {W,X,P} detect numerical error and recalculate them automatically detect numerical error and recalculate them automatically Choice of the shift Choice of the shift Portability Portability

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.3 Preconditioning T~H -1 T~H -1 H=A+B 1 +B 2 +B 3 +B 4 +C 12 +C 23 +C 34 H~(A+B 1 ) H~ (A+B 1 )A -1 (A+B 2 ) H~A Here, A: diagonal A+B x : block-tridiagonal  shift + LDL t is used

III. Grand challenge, Parallel Computing on ES, etc.

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.2 Technical Issues on the Earth Simulator Programming model Programming model  hybrid of distributed parallelism and thread parallelism. Processor 0 Processor 1 Processor 7 node Intra-Node Vector processing node Inter-Node Inter-Node : Inter-Node : MPI (Message Passing Interface) MPI (Message Passing Interface) Low latency (6.63[us]) Low latency (6.63[us]) Very fast (11.63[GB/s]) Very fast (11.63[GB/s]) Intra-Node : Intra-Node : Auto-parallelization Auto-parallelization OpenMP (thread-level parallelism) OpenMP (thread-level parallelism) Vector Processor (most-inner loops) : Vector Processor (most-inner loops) : Auto-/manual- Vectorization Auto-/manual- Vectorization 3-level parallelism

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.3 Quantum Simulation parallel code Application flow chart Application flow chart Eigenmode calculation Time Integrator Quantum state analyzer Parallel LOBPCG solver developed on ES Visualization Parallel code on ES Visualized by AVS

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.4 Handling of Huge Data Data distribution in case of a 4D array Data distribution in case of a 4D array k i, j l i j (k, l ) / N P intra-node parallelization i loop length=256 vector processing 2-dimensionnal loop decomposition 1-dimension loop decomposition (k, l ) / N P j /M P N P : Number of MPI processes M P : Number of microtasking processes (=8) (k,l)(j)(j)

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.5 Parallel LOBPCG Core implementation is MATRIX-VECTOR mult. Core implementation is MATRIX-VECTOR mult. 3-level parallelism is carefully done in our implementation. 3-level parallelism is carefully done in our implementation. In Inter-node parallelization, communication pipelining is used. In Inter-node parallelization, communication pipelining is used. In the Rayleigh-Ritz part, SCALAPACK is used. In the Rayleigh-Ritz part, SCALAPACK is used. LOBPCG inter-node parallelism do l=1,256 :: inter-node parallelism inter-node parallelism do k=1,256 :: inter-node parallelism intra-node (thread) parallelism do j=1,256 :: intra-node (thread) parallelism vectorization do i=1,256 :: vectorization w(i,j,k,l)=a(i,j,k,l)*v(i,j,k,l) & +b*(v(i+1,j,k,l)+ ・・・ ) +c*(v(i+1,j+1,k,l)+ ・・・ ) enddo

IV. Numerical Results,

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 4.1, Numerical Result Preliminary test of our eigensolver Preliminary test of our eigensolver 4-junction system: -> 256^4 dimension 4-junction system: -> 256^4 dimension CPUstime[s]TFLOPS Performance (5 eigenmodes) Convergence history (10 eigenmodes)

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Initial State Potential Change: Only a Single Junction ? CapacitiveCoupling Question: Synchronization or Independence (Localization) The Simplest Case: (two Junctions) 4.2, Numerical Result (Scenario) Discretization: 256 grids 4-junction system : 2

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Two-Stacked Intrinsic Josephson Junction Classical Regime: Independent Dynamics Quantum Regime: ? 4.3, Numerical Result

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) q1q1 q2q2 q1q1 q2q2 t=0.0(a.u.)t=2.9(a.u.) q1q1 q2q2 q1q1 q2q2 t=9.2(a.u.)t=10.0(a.u.) α = 0.4 β = 0.2

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) t=0.0(a.u.)t=2.5(a.u.) t=4.2(a.u.)t=10.0(a.u.) q1q1 q2q2 q1q1 q2q2 q1q1 q2q2 q1q1 q2q2 α = 0.4 β = 1.0

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Weakly Quantum(Classical): Independence Strongly Quantum: Synchronization Two Junctions

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Three Junctions

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) α = 0.4 β = 0.2

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) α = 0.4 β = 1.0

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) <q1><q1> <q2><q2> <q3><q3> <q4><q4> <q1><q1> <q2><q2> <q3><q3> <q4><q4> t(a.u.) q q (a) (b) 4 Junctions α=0.4 β=0.2 α=0.4 β=1.0 Quantum Assisted Synchronization

V. Conclusion

□ Jan. 4-8, RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 5. Conclusion Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES Direct Quantum Simulation (4-Junctions) Direct Quantum Simulation (4-Junctions) Quantum (Sychronus) vs Classical (Localized) Quantum (Sychronus) vs Classical (Localized) Quantum Assisted Synchronization Quantum Assisted Synchronization High Performance Computing High Performance Computing Novel eigenvalue algorithm LOBPCG Novel eigenvalue algorithm LOBPCG Communication-free (or less) implementation Communication-free (or less) implementation Sustained 7TFLOPS (21.4% of Peak) Sustained 7TFLOPS (21.4% of Peak) Toward Peta-scale computing? Toward Peta-scale computing?

Thank you! 謝謝 Further information Physics: HPC: