Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Optimizing Membrane System Implementation with Multisets and Evolution Rules Compression Workshop on Membrane Computing Eighth page 1 Optimizing Membrane.
Beowulf Supercomputer System Lee, Jung won CS843.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
History of Distributed Systems Joseph Cordina
Software Tools for Dynamic Resource Management Irina V. Shoshmina, Dmitry Yu. Malashonok, Sergay Yu. Romanov Institute of High-Performance Computing and.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Parallel Programming Models and Paradigms
24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Antonio M. Vidal Jesús Peinado
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
16 December 2005Universidad de Murcia1 Research in parallel routines optimization Domingo Giménez Dpto. de Informática y Sistemas Javier Cuenca Dpto. de.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Makoto Kudoh*1, Hisayasu Kuroda*1,
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Clustering Technology. Clustering Schematic Cluster Components Cluster hardware (processor, main memory, hard disk, …) Cluster network (Fast Ethernet,
MIMD Distributed Memory Architectures message-passing multicomputers.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.
“DECISION” PROJECT “DECISION” PROJECT INTEGRATION PLATFORM CORBA PROTOTYPE CAST J. BLACHON & NGUYEN G.T. INRIA Rhône-Alpes June 10th, 1999.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Processes Introduction to Operating Systems: Module 3.
Computing Resources at Vilnius Gediminas Technical University Dalius Mažeika Parallel Computing Laboratory Vilnius Gediminas Technical University
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
Faucets Queuing System Presented by, Sameer Kumar.
Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo Giménez Department of Informática y Sistemas University of Murcia.
Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.
Background Computer System Architectures Computer System Software.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Use of Performance Prediction Techniques for Grid Management Junwei Cao University of Warwick April 2002.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Steven Whitham Jeremy Woods
Advances in the Optimization of Parallel Routines (I)
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería.
Advances in the Optimization of Parallel Routines (II)
Advances in the Optimization of Parallel Routines (II)
Smita Vijayakumar Qian Zhu Gagan Agrawal
CSE8380 Parallel and Distributed Processing Presentation
MPJ: A Java-based Parallel Computing System
Prof. Leonardo Mostarda University of Camerino
Advances in the Optimization of Parallel Routines (III)
Automatic optimization of parallel linear algebra software
Automatic Optimization in Parallel Dynamic Programming Schemes
Performance-Robust Parallel I/O
Presentation transcript:

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load Javier Cuenca Domingo Giménez José González Jack Dongarra Kenneth Roche

Optimisation of Linear Algebra Routines Traditional method: Hand-Optimisation for each platform ›Time-consuming ›Incompatible with Hardware Evolution ›Incompatible with changes in the system ›(architecture and basic libraries) ›Unsuitable for systems with variable load ›Misuse by non expert users

Solutions to this situation? Some groups and projects: ATLAS, GrADS, LAWRA, FLAME, I-LIB But the problem is very complex. OCULTA

Our Approach Modelling the Linear Algebra Routine (LAR): T exec = f (SP, AP, n) SP:System Parameters AP:Algorithmic Parameters n:Problem size Estimation of SP Selection of AP values Execution of LAR DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Our Approach LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME OCULTA

Our Approach LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2 Static Model of LAR: Situation of platform at installation time

Our Approach LARs Jacobi methods for the symmetric eigenvalue problem Gauss elimination LU factorisation QR factorisation Platforms Cluster of Workstations Cluster of PCs SGI Origin 2000 IBM SP2 Static Model of LAR: Situation of platform at installation time Dynamic Model of LAR: Situation of platform at run-time.

DESIGN PROCESS DESIGNDESIGN LAR: Linear Algebra Routine Made by the LAR Designer LAR Example of LAR: Parallel Block LU factorisation

Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN

Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN T exec = f (SP, AP, n) SP: System Parameters AP: Algorithmic Parameters n : Problem size Made by the LAR-Designer Only once per LAR

Modelling the LAR LAR Modelling the LAR MODEL DESIGNDESIGN SP: k 3, k 2, t s, t w AP: p, b n : Problem size MODEL LAR: Parallel Block LU factorisation

Implementation of SP-Estimators LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators DESIGNDESIGN

Implementation of SP-Estimators LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators DESIGNDESIGN Estimators of Arithmetic-SP Computation Kernel of the LAR Similar storage scheme Similar quantity of data Estimators of Communication-SP Communication Kernel of the LAR Similar kind of communication Similar quantity of data

INSTALLATION PROCESS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators INSTALLATIONINSTALLATION DESIGNDESIGN Installation Process Only once per Platform Done by the System Manager

Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION

Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Basic Libraries Basic Communication Library: MPI PVM Basic Linear Algebra Library: reference-BLAS machine-specific-BLAS ATLAS Installation File SP values are obtained using the information (n and AP values) of this file.

Estimation of Static-SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN INSTALLATIONINSTALLATION Estimation of the Static-SP t w-static (in  sec) Message size (Kbytes) t w-static Platform:Cluster of Pentium III + Fast Ethernet Basic Libraries: ATLAS and MPI Estimation of the Static-SP k 3-static (in  sec) Block size k 3-static

RUN-TIME PROCESS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION

LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP RUN-TIME PROCESS: Static approach

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 and 8 processors. LU on IBM SP2 OCULTA

LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static approach

LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static approach p=4devStatic noptMODELMODEL % % % % % % OCULTA

LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION Optimum-AP Selection of Optimum AP Execution of LAR RUN-TIME PROCESS: Static p=8devStatic noptMODELMODEL % % % % OCULTA

LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File DESIGNDESIGN RUN-TIMERUN-TIME INSTALLATIONINSTALLATION RUN-TIME PROCESS: Dynamic Approach

Call to NWS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

RUN-TIMERUN-TIME NWS Information Call to NWS The NWS is called and it reports:  the fraction of available CPU (f CPU )  the current word sending time (t w- current ) for a specific n and AP values (n 0, AP 0 ). Then the fraction of available network is calculated:

Call to NWS LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Dynamic Adjustment of SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Dynamic Adjustment of SP Current-SP Dynamic Adjustment of SP NWS Information Call to NWS The values of the SP are adjusted, according to the current situation: Static-SP-File RUN-TIMERUN-TIME

Dynamic Adjustment of SP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Selection of Optimum AP LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Optimum-AP Selection of Optimum AP RUN-TIMERUN-TIME Selection of Optimum AP Current-SP Dynamic Adjustment of SP NWS Information Call to NWS Static-SP-File OCULTA

Execution of LAR LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME

Execution of LAR LAR Modelling the LAR MODEL Implementation of SP-Estimators SP-Estimators Estimation of Static-SP Static-SP-File Basic LibrariesInstallation-File Current-SP Dynamic Adjustment of SP Optimum-AP Selection of Optimum AP Execution of LAR NWS Information Call to NWS DESIGNDESIGN INSTALLATIONINSTALLATION RUN-TIMERUN-TIME OCULTA

Platform load: different situations studied nodo1nodo2nodo3nodo4nodo5nodo6nodo7nodo8 Situation A CPU avail.100%100%100%100%100%100%100%100% t w-current 0.7  sec Situation B CPU avail.80%80%80%80%100% 100%100%100% t w-current 0.8  sec0.7  sec Situation C CPU avail.60%60%60%60%100%100%100%100% t w-current 1.8  sec0.7  sec Situation D CPU avail.60%60%60%60%100%100%80%80% t w-current 1.8  sec0.7  sec0.8  sec Situation E CPU avail.60%60%60%60%100%100%50%50% t w-current 1.8  sec0.7  sec4.0  sec

Platform load: different situations studied OCULTA

Optimum AP for the different situations studied Block size Situations of the Platform Load nABCDE Number of nodes to use p = r  c Situations of the Platform Load nABCDE  24  22  22  2 2   24  22  22  22   24  22  2 2  22  1

Experimental Time: deviations from the Optimum

Conclusions and Future Work The use of the proposed methodology is viable in systems where the load is stable or variable. Software like NWS is suitable for the adjustment of the system parameters’ values obtained at installation time. The heterogeneous load case offers many more possibilities than the one studied.