Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth Roche
Optimisation of Linear Algebra Routines Traditional method: Hand-Optimisation for each platform ›Time-consuming ›Incompatible with Hardware Evolution ›Incompatible with changes in the system ›(architecture and basic libraries) ›Unsuitable for systems with variable load ›Misuse by non expert users
Solutions to this situation? Some groups and projects: ATLAS, GrADS, LAWRA, FLAME, I-LIB But the problem is very complex. OCULTA
Our Approach Modelling the Linear Algebra Routine (LAR): T exec = f (SP, AP, n) SP:System Parameters AP:Algorithmic Parameters n:Problem size Estimation of SP Selection of AP values Execution of LAR DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Our Approach LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME OCULTA
Our Approach LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2 Static Model of LAR: Situation of platform at installation time
Our Approach LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2 Static Model of LAR: Situation of platform at installation time Dynamic Model of LAR: Situation of platform at run-time.
DESIGN PROCESS DESIGNDESIGN LAR: Linear Algebra Routine Made by the LAR Designer LAR Example of LAR: Parallel Block LU factorisation
Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN
Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN T exec = f (SP, AP, n) SP: System Parameters AP: Algorithmic Parameters n : Problem size Made by the LAR-Designer Only once per LAR
Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN SP: k 3, k 2, t s, t w AP: p, b n : Problem size MODEL LAR: Parallel Block LU factorisation
Implementation of SP-Estimators LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators DESIGNDESIGN
Implementation of SP-Estimators LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators DESIGNDESIGN Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication Similar quantity of data
INSTALLATION PROCESS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators INSTALLATIONINSTALLATION DESIGNDESIGN Installation Process Only once per Platform Done by the System Manager
Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION
Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS Installation File SP values are obtained using the information (n and AP values) of this file.
Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Estimation of the Static-SP t w-static (in sec) Message size (Kbytes) t w-static Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI Estimation of the Static-SP k 3-static (in sec) Block size k 3-static
RUN-TIME PROCESS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION
LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP RUN-TIME PROCESS: Static approach
Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. LU on IBM SP2 OCULTA
LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static approach
LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static approach p=4devStatic noptMODELMODEL % % % % % % OCULTA
LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static p=8devStatic noptMODELMODEL % % % % OCULTA
LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION RUN-TIME PROCESS: Dynamic Approach
Call to NWS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
RUN-TIMERUN-TIME NWS Information Call to NWS The NWS is called and it reports: the fraction of available CPU (f CPU ) the current word sending time (t w- current ) for a specific n and AP values (n 0, AP 0 ). Then the fraction of available network is calculated:
Call to NWS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Dynamic Adjustment of SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Dynamic Adjustment of SP Current-SP Dynamic Adjustment of SP NWS Information Call to NWS The values of the SP are adjusted, according to the current situation: Static-SP-File RUN-TIMERUN-TIME
Dynamic Adjustment of SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Selection of Optimum AP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Optimum-AP Selection of Optimum AP RUN-TIMERUN-TIME Selection of Optimum AP Current-SP Dynamic Adjustment of SP NWS Information Call to NWS Static-SP-File OCULTA
Execution of LAR LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME
Execution of LAR LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME OCULTA
Platform load: different situations studied nodo1nodo2nodo3nodo4nodo5nodo6nodo7nodo8 Situation A CPU avail.100%100%100%100%100%100%100%100% t w-current 0.7 sec Situation B CPU avail.80%80%80%80%100% 100%100%100% t w-current 0.8 sec0.7 sec Situation C CPU avail.60%60%60%60%100%100%100%100% t w-current 1.8 sec0.7 sec Situation D CPU avail.60%60%60%60%100%100%80%80% t w-current 1.8 sec0.7 sec0.8 sec Situation E CPU avail.60%60%60%60%100%100%50%50% t w-current 1.8 sec0.7 sec4.0 sec
Platform load: different situations studied OCULTA
Optimum AP for the different situations studied Block size Situations of the Platform Load nABCDE Number of nodes to use p = r c Situations of the Platform Load nABCDE 24 22 22 2 2 24 22 22 22 24 22 2 2 22 1
Experimental Time: deviations from the Optimum
Conclusions and Future Work The use of the proposed methodology is viable in systems where the load is stable or variable. Software like NWS is suitable for the adjustment of the system parameters’ values obtained at installation time. The heterogeneous load case offers many more possibilities than the one studied.