CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Slides:

Advertisements

Similar presentations

MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.

Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Distributed Systems CS

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

History of Distributed Systems Joseph Cordina

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Optimization and evaluation of parallel I/O in BIPS3D parallel irregular application Performance Modelling, Evaluation, and optimization of Parallel and.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

Parallel System Performance CS 524 – High-Performance Computing.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

FLANN Fast Library for Approximate Nearest Neighbors

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

1 Telematics/Networkengineering Confidential Transmission of Lossless Visual Data: Experimental Modelling and Optimization.

2006/1/23Yutaka Ishikawa, The University of Tokyo1 An Introduction of GridMPI Yutaka Ishikawa and Motohiko Matsuda University of Tokyo Grid Technology.

Ishikawa, The University of Tokyo1 GridMPI ： Grid Enabled MPI Yutaka Ishikawa University of Tokyo and AIST.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Example: Sorting on Distributed Computing Environment Apr 20,

Distributed Virtual Environments Introduction. Outline What are they? DVEs vs. Analytic Simulations DIS –Design principles Example.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Data and Computer Communications Chapter 11 – Asynchronous Transfer Mode.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CSCI-455/552 Introduction to High Performance Computing Lecture 6.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Interconnection network network interface and a case study.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Design and implementation Chapter 7 – Lecture 1. Design and implementation Software design and implementation is the stage in the software engineering.

High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory

UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Pluggable Architecture for Java HPC Messaging

Summary Background Introduction in algorithms and applications

Architectural Interactions in High Performance Clusters

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Hybrid Programming with OpenMP and MPI

MPJ: A Java-based Parallel Computing System

January 25 Did you get mail from Chun-Fa about assignment grades?

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús Carretero University Carlos III of Madrid.

Summary Problem description Main objectives CoMPI Study of compression algorithms. Evaluation of CoMPI Results Conclusions

Summary Problem description Main objectives CoMPI Study of compression algorithms. Evaluation of CoMPI Results Conclusions

Problem description Cluster architecture solution for scientific applications. Collection of computers working together. Interconnected not always by a fast network. Scientific applications need: Large number of computer nodes. Huge volume of data transferred among the processes. Communication system becomes a limiting factor of performance Network with high latency and low bandwidth  Network saturation. Program model by using in clusters is MPI.

Main objectives (1/2) Overall Time Scalability Reduce the communication transfer time for MPI.

Main objectives (2/2) CoMPI: Optimization of MPI communications by using compression. Compression in all MPI primitives. Fit any MPI application. Transparent to user. Run-time compression. Studding of compression algorithms. Selecting the best algorithm based on message characteristics.

Summary Problem description Main objectives CoMPI –How we have integrated compression into MPI –Set of compression algorithms proposed Study of compression algorithms. Evaluation of CoMPI Results Conclusions

MPICH architecture (1/2) Point to Point. Collective. MPI Communication mechanism Application Programmer Interface (API). Abstract Device Interface (ADI). Channel Interface (CI) MPICH layers Control the data. Specifies whether the message is sent or receiver. Message queues management. Messages passing protocols. ADI layer Collective routines are implemented by using point-to-point routines. Point to Point are provided by ADI Data compression and decompression. Integrated compression library. Modification ADI layer

MPICH architecture (2/2)

Compression of MPI Messages (1/2)

Compression of MPI Messages (2/2) Header in the exchanged message to inform: –Compression used or not, algorithm and length. All compression algorithms are included in a single Compression Library: –CoMPI can be easily updated. – New compression algorithms can be included. Message size evaluation. Compression algorithm selection. Data compression. Header inclusion. Compression stages Header checking Data decompression Decompression stages

Set of compression algorithms proposed (1/2) Compressor selected in CoMPI Smallest overhead Lossless compressor

Set of compression algorithms proposed (2/2)

Summary Problem description Main objectives CoMPI Study of compression algorithms. –Conclusion of compression study. Evaluation of CoMPI Results Conclusions

Study of compression algorithms(1/7) To select the most appropriated algorithm for each datatype based on: –Buffer size. –Redundancy level. To Increase the transmission speed by using compression depends on: –Number of bits sent. –Time required to compress. –Time required to decompress.

Study of compression algorithms(2/7) Synthetic datasets Integer Floating- point. Double precision. Each datasets contains buffers with different Buffer size: 100, 500, 900 and 1500 KB. Redundancy level: 0%, 25 %, 50 %, 75 % and 100 % For each algorithm, datatype, buffer size and redundancy level we will study the Complexity and Compression ratio.

Study of compression algorithms(3/7)

Study of compression algorithms(4/7) Integer dataset

Study of compression algorithms(5/7) Floating-point dataset

Study of compression algorithms(6/7) Double precision dataset WITHOUT pattern

Study of compression algorithms(7/7) Double precision WITH pattern: Data sequence  , , …

Conclusion of compression study Integer and Floating-point 0% Redundancy : No compress. 25% to 100 % Redundancy : LZO. Double precision Without PatternLZO.With Pattern 0% to 50 % Redundancy : FPC. 50% to 100 % Redundancy : LZO

Summary Problem description Main objectives CoMPI Study of compression algorithms. Evaluation of CoMPI Results Conclusions

Evaluation of CoMPI MPICHGM NOGM- COMP 64 Nodes Dual Core AMD 512MB of RAM FastEthernet Network. NAS Parallel IS  Integer LU  Double. BISP3D  Float. PSRG  Integer STEM-II  Float. ApplicationsBenchmarks DistributionCluster

Summary Problem description Main objectives CoMPI Study of compression algorithms. Evaluation of CoMPI Results –Real Applications –Benchmarks Conclusions

Results (1/5) BISP3D: –Floating-point data. – Improves between x1.2 and x1.4 with LZO.

Results (2/5) PSRG: –Integer data. – Improves up to x2 with LZO.

Results (3/5) STEM-II: –Floating-point data. – Improves to x1.4 with LZO.

Results (4/5) IS : –Integer data. –Improves to x1.2 with LZO. –Rice obtains good results with 32 processes.

Results (5/5) LU: –Double precision. – No better performance. Only with 64 processes by using FPC we obtain a speedup of x1.1

Summary Problem description Main objectives CoMPI Study of compression algorithms. Evaluation of CoMPI Results Conclusions –Principal Conclusion. –On going.

Principal conclusions (1/2) New Compression library integrated into MPI using MPICH distribution  CoMPI. CoMPI includes five different compression algorithms and compress all MPI primitives. Main characteristics: –Transparent for the users. –Fit any application without any change in it. We have evaluated CoMPI using: –Synthetic traces. –Real applications.

Principal conclusion (2/2) The results of evaluations demonstrated that in most of the cases, the compression: –Reduce the overall execution time. –Enhance the scalability. When compression is not appropriated: –Little performance degradation.

On going (1/2) Adaptive Compression Select the most appropriate compression algorithm. Compression Turn on/off In run-time to application  Learning from communication history taking account : Message characteristics: Datatype Redundancy level Platform: Network latency and bandwidth Compression algorithms behavior

On going (2/2)

Questions?