Download presentation
Presentation is loading. Please wait.
Published byOwen Taylor Modified over 8 years ago
1
Parallelizing Legacy Applications in Message Passing Programming Model and the Example of MOPAC Tseng-Hui (Frank) Lin thlin@npac.syr.edu tsenglin@us.ibm.com
2
April 7, 20002 Legacy Applications Performed functions still useful Large user population Invested big money Rewriting is expensive Rewriting is risky Changed through long time period Modified by diff people Historical code Dead code Old concepts Major bugs fixed
3
April 7, 20003 What Legacy Applications Need Provide higher resolution Run bigger data Graphic representation for scientific data Keep certified
4
April 7, 20004 How to Meet the Requirements Improve performance: Parallel computing Keep Certified: Change critical parts only Better user interface: Add GUI
5
April 7, 20005 System Configuration
6
April 7, 20006 Distributed vs Shared Memory
7
April 7, 20007 Message Passing Programming Non-parallelizable parts –Data dependent forces sequential execution –Not worthy to parallelize Workload distribution Input data distribution Distributed Computation –Load balance Results collection
8
April 7, 20008 Non-Parallelable Parts Amdahl’s Law
9
April 7, 20009 MOPAC Semi-empirical molecular orbital pkg –MNDO, MINDO/3, AM1, PM3 MOPAC 3 submitted to QCPE in 1985 MOPAC 6 ported to many platforms –VMS –UNIX (our work based on this version) –DOS/Windows MOPAC 7 is current
10
April 7, 200010 MOPAC input file L1 : UHF PULAY MINDO3 VECTORS DENSITY LOCAL T=300 L2 : EXAMPLE OF DATA FOR MOPAC L3 : MINDO/3 UHF CLOSED-SHELL D2D ETHYLENE L41: C L42: C 1.400118 1 L43: H 1.098326 1 123.572063 1 L44: H 1.098326 1 123.572063 1 180.000000 0 2 1 3 L45: H 1.098326 1 123.572063 1 90.000000 0 1 2 3 L46: H 1.098326 1 123.572063 1 270.000000 0 1 2 3 L5 : Keywords Comments Molecule Structure in Z-matrix (Internal Coordinate) Blank Line End-of-Data Title
11
April 7, 200011 Hartree-Fock Self Consistent Field Schrödinger equation Matrix equation form (3.1.3) Matrix representation of Fock matrix (3.1.4)
12
April 7, 200012 HF-SCF Procedure S1: Calc molecular integrals O(n 4 ) S2: Guess initial eigenvector C S3: Use C to compute F O(n 4 ) S4: Transform F to orthogonal basis O(n 3 ) diagonalize F to get a new C O(n 3 ) S5: Stop if C converged S6: Guess new C and goto S3
13
April 7, 200013 MOPAC computation Ab initio HF-SCF –evaluate all integrals rigorously –accuracy –requires high computing power –limited molecule size Semi-empirical HF-SCF –use the same procedure –reduce computational complexity –support larger molecule size
14
April 7, 200014 Semi-empirical SCF Ignore some integrals Use experiment results to replace integrals Assume AO basis is orthogonal S1, S3: O(n 4 )=>O(n 2 ) S4 orthogonalization not needed New bottle neck: diagonalization Complexity: O(n 3 )
15
April 7, 200015 Parallelization Procedure Sequential analysis –Time profiling analysis –Program flow analysis –Comp Complexity analysis Parallel analysis –Data dependence resolution –Loop parallelization Integration –Communication between modules
16
April 7, 200016 Sequential Analysis Time profiling analysis –Pick up the computational intensive parts –Usually use smaller input data Program flow analysis –Verify the chosen ones are commonly used –Domain expert not required Comp Complexity analysis –Workload distribution changed significantly for different data sizes
17
April 7, 200017 MOPAC Sequential Analysis Assume the complexity of the rest part is O(n 2 )
18
April 7, 200018 Loop Parallelization Scalar forward subst: remove temp vars Induction variable subst: resolv depend Loop interchange/merge enlarge granularity, reduce synchronization Scalar expansion resolve data dependence on scalars Variable copying resolve data dependence on arrays
19
April 7, 200019 MOPAC Parallelization: DENSIT Function: compute density matrix 2 1-level loops inside of a 2-level loop Triangular computational space Merge the outer 2-level loop to 1 loop with range [1..n(n+1)/2] Lower comp/comm ratio (when n small) benefit from low latency communication when n is small
20
April 7, 200020 MOPAC Parallelization: DIAG P1: Generate Fock modular orbital matrix –Higher comp/comm ratio –Find global maximum TINY from local ones –Need to re-distribute matrix FMO for Part 2 P2: 2X2 rotation to eliminate significant off-diagonal elements –“if” structure cause load imbalance –Need to exchange the inner most loop out –Some calculations run on all nodes to save comm
21
April 7, 200021 MOPAC Parallelization: HQRII Function: standard eigensolver R. J. Allen survey Use PNNL PeIGS pdspevx() function Use MPI communication library Small chunk data exchange, good if n/p>8 Implemented in C, different way to pack matrix (row major)
22
April 7, 200022 Integration
23
April 7, 200023 Comm between Modules Parallel - sequential –Use TCP/IP –Auto upgrade to shared memory if possible Sequential - user interface –Input and output files –Application/Advanced Visualization System (AVS) remote module communication User interface - display –AVS
24
April 7, 200024 MOPAC Cntl Panel & Module
25
April 7, 200025 MOPAC GUI
26
April 7, 200026 Data Files and Platform Platforms: –SGI Power Challenge –IBM SP2
27
April 7, 200027 DENSIT Speed-up
28
April 7, 200028 DENSIT Speed-up Power ChallengeSP2
29
April 7, 200029 DIAG Speed-up
30
April 7, 200030 DIAG Speed-up Power ChallengeSP2
31
April 7, 200031 HQRII Speed-up
32
April 7, 200032 HQRII Speed-up Power ChallengeSP2
33
April 7, 200033 Overall Speed-up Projected, assuming sequential part is O(n 2 ) Power ChallengeSP2
34
April 7, 200034 Overall Speedup Assume non-parallelizable part is O(1) and O(n 2 )
35
April 7, 200035 Related work: IBM Application: Conformational search Focus: Throughput
36
April 7, 200036 Related work: SDSC Focus: performance Parallelizing: –Evaluate electronic repulsion integrals –Calculate first and second derivatives –Solve eigensystem Platform: 64-node iPSC/860 Results: –Geometry optimization: speedup=5.2 –Vibration analysis: speedup=40.8
37
April 7, 200037 Achievements Parallelize legacy apps from CS perspective Keep code validated Performance analysis procedures Predict large data performance Optimize parallel code Improve performance Improve user interface
38
April 7, 200038 Future Work Shared memory model Web based user interface Dynamic node allocation Parallelization of subroutines with lower computational complexity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.