Download presentation
Presentation is loading. Please wait.
Published byMagdalen Tyler Modified over 9 years ago
1
High-Performance Quantum Simulation: A challenge to Schr ö dinger equation on 256^4 grids * Toshiyuki Imamura 13 今村俊幸, Thanks to Susumu Yamada 23, Takuma Kano 2, and Masahiko Machida 23 Takuma Kano 2, and Masahiko Machida 23 1. UEC (University of Electro-Communications 電気通信大学 ) 2. CCSE JAEA (Japan Atomic Energy Agency) 3. CREST JST (Japan Science Technology)
2
□ Jan. 4-8, 2008 2 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)Outline I. Physics, Review of Quantum Simulation II. Mathematics, Numerical Algorithm III. Grand Challenge, Parallel Computing on ES IV. Numerical Results V. Conclusion
3
I. Physics, Review of Quantum Simulation, etc.
4
□ Jan. 4-8, 2008 4 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) S W’ I S W down-sizing Crossover from Classical to Quantum ??? (1/2) 1.1, Quantum Simulation (1/2) Classical Equation of Motion Schroedinger Equation
5
□ Jan. 4-8, 2008 5 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Numerical Simulation for Coupled Schrodinger Eq. α : Coupling Requirement of Exact Diagonalization for the Hamiltonian 1.2, Quantum Simulation (2/2) β : 1/Mass ∝ 1 / W H : Spectral expansion by {u n } eigenvecs. : possible state not a value but a vector! Numerical method to solve the above equation
6
II. Mathematics, Numerical Algorithm, etc.
7
□ Jan. 4-8, 2008 7 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.1 Krylov Subspace Iteration Lanczos (Traditional method) Lanczos (Traditional method) Krylov+GS : Simple, but shift+invert version is needed Krylov+GS : Simple, but shift+invert version is needed LOBPCG (Locally Optimal Block PCG) LOBPCG (Locally Optimal Block PCG) {Krylov base, Ritz vector, prior vector} : CG approach {Krylov base, Ritz vector, prior vector} : CG approach **Restart at every iteration** **INVERSE-free** -> Less Communication LOBPCG Lanczos
8
□ Jan. 4-8, 2008 8 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.2 LOBPCG Costly! Since the block is updated at every iteration, MV operation is also required!! Costly! Since the block is updated at every iteration, MV operation is also required!! 1*MV / every iteration 3*MV / every iteration Other Difficulties in implementation Breakdown of linear independency make our own DSYGV using LDL and deflation (not Cholesky) make our own DSYGV using LDL and deflation (not Cholesky) Growth of numerical error in {W,X,P} Growth of numerical error in {W,X,P} detect numerical error and recalculate them automatically detect numerical error and recalculate them automatically Choice of the shift Choice of the shift Portability Portability
9
□ Jan. 4-8, 2008 9 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 2.3 Preconditioning T~H -1 T~H -1 H=A+B 1 +B 2 +B 3 +B 4 +C 12 +C 23 +C 34 H~(A+B 1 ) H~ (A+B 1 )A -1 (A+B 2 ) H~A Here, A: diagonal A+B x : block-tridiagonal shift + LDL t is used
10
III. Grand challenge, Parallel Computing on ES, etc.
11
□ Jan. 4-8, 2008 11 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.2 Technical Issues on the Earth Simulator Programming model Programming model hybrid of distributed parallelism and thread parallelism. Processor 0 Processor 1 Processor 7 node Intra-Node Vector processing node Inter-Node Inter-Node : Inter-Node : MPI (Message Passing Interface) MPI (Message Passing Interface) Low latency (6.63[us]) Low latency (6.63[us]) Very fast (11.63[GB/s]) Very fast (11.63[GB/s]) Intra-Node : Intra-Node : Auto-parallelization Auto-parallelization OpenMP (thread-level parallelism) OpenMP (thread-level parallelism) Vector Processor (most-inner loops) : Vector Processor (most-inner loops) : Auto-/manual- Vectorization Auto-/manual- Vectorization 3-level parallelism
12
□ Jan. 4-8, 2008 12 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.3 Quantum Simulation parallel code Application flow chart Application flow chart Eigenmode calculation Time Integrator Quantum state analyzer Parallel LOBPCG solver developed on ES Visualization Parallel code on ES Visualized by AVS
13
□ Jan. 4-8, 2008 13 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.4 Handling of Huge Data Data distribution in case of a 4D array Data distribution in case of a 4D array k i, j l i j (k, l ) / N P intra-node parallelization i loop length=256 vector processing 2-dimensionnal loop decomposition 1-dimension loop decomposition (k, l ) / N P j /M P N P : Number of MPI processes M P : Number of microtasking processes (=8) (k,l)(j)(j)
14
□ Jan. 4-8, 2008 14 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 3.5 Parallel LOBPCG Core implementation is MATRIX-VECTOR mult. Core implementation is MATRIX-VECTOR mult. 3-level parallelism is carefully done in our implementation. 3-level parallelism is carefully done in our implementation. In Inter-node parallelization, communication pipelining is used. In Inter-node parallelization, communication pipelining is used. In the Rayleigh-Ritz part, SCALAPACK is used. In the Rayleigh-Ritz part, SCALAPACK is used. LOBPCG inter-node parallelism do l=1,256 :: inter-node parallelism inter-node parallelism do k=1,256 :: inter-node parallelism intra-node (thread) parallelism do j=1,256 :: intra-node (thread) parallelism vectorization do i=1,256 :: vectorization w(i,j,k,l)=a(i,j,k,l)*v(i,j,k,l) & +b*(v(i+1,j,k,l)+ ・・・ ) +c*(v(i+1,j+1,k,l)+ ・・・ ) enddo
15
IV. Numerical Results,
16
□ Jan. 4-8, 2008 16 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 4.1, Numerical Result Preliminary test of our eigensolver Preliminary test of our eigensolver 4-junction system: -> 256^4 dimension 4-junction system: -> 256^4 dimension CPUstime[s]TFLOPS 204831183.65 307225354.49 409616217.02 Performance (5 eigenmodes) Convergence history (10 eigenmodes)
17
□ Jan. 4-8, 2008 17 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Initial State Potential Change: Only a Single Junction ? CapacitiveCoupling Question: Synchronization or Independence (Localization) The Simplest Case: (two Junctions) 4.2, Numerical Result (Scenario) Discretization: 256 grids 4-junction system : 256 4 0 22
18
□ Jan. 4-8, 2008 18 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Two-Stacked Intrinsic Josephson Junction Classical Regime: Independent Dynamics Quantum Regime: ? 4.3, Numerical Result
19
□ Jan. 4-8, 2008 19 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) q1q1 q2q2 q1q1 q2q2 t=0.0(a.u.)t=2.9(a.u.) q1q1 q2q2 q1q1 q2q2 t=9.2(a.u.)t=10.0(a.u.) α = 0.4 β = 0.2
20
□ Jan. 4-8, 2008 20 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) t=0.0(a.u.)t=2.5(a.u.) t=4.2(a.u.)t=10.0(a.u.) q1q1 q2q2 q1q1 q2q2 q1q1 q2q2 q1q1 q2q2 α = 0.4 β = 1.0
21
□ Jan. 4-8, 2008 21 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Weakly Quantum(Classical): Independence Strongly Quantum: Synchronization Two Junctions
22
□ Jan. 4-8, 2008 22 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) Three Junctions
23
□ Jan. 4-8, 2008 23 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) α = 0.4 β = 0.2
24
□ Jan. 4-8, 2008 24 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) α = 0.4 β = 1.0
25
□ Jan. 4-8, 2008 25 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) <q1><q1> <q2><q2> <q3><q3> <q4><q4> <q1><q1> <q2><q2> <q3><q3> <q4><q4> t(a.u.) q q (a) (b) 4 Junctions α=0.4 β=0.2 α=0.4 β=1.0 Quantum Assisted Synchronization
26
V. Conclusion
27
□ Jan. 4-8, 2008 27 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾) 5. Conclusion Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES Direct Quantum Simulation (4-Junctions) Direct Quantum Simulation (4-Junctions) Quantum (Sychronus) vs Classical (Localized) Quantum (Sychronus) vs Classical (Localized) Quantum Assisted Synchronization Quantum Assisted Synchronization High Performance Computing High Performance Computing Novel eigenvalue algorithm LOBPCG Novel eigenvalue algorithm LOBPCG Communication-free (or less) implementation Communication-free (or less) implementation Sustained 7TFLOPS (21.4% of Peak) Sustained 7TFLOPS (21.4% of Peak) Toward Peta-scale computing? Toward Peta-scale computing?
28
Thank you! 謝謝 Further information Physics: machida.masahiko@jaea.go.jp machida.masahiko@jaea.go.jp HPC: imamura@im.uec.ac.jp imamura@im.uec.ac.jp
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.