1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

Slides:

Advertisements

Similar presentations

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

Advertisements

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Grid Communication Simulator Boro Jakimovski Marjan Gusev Institute of Informatics Faculty of Natural Sciences and Mathematics University of Sts. Cyril.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

MPJava : High-Performance Message Passing in Java using Java.nio Bill Pugh Jaime Spacco University of Maryland, College Park.

Communication Pattern Based Node Selection for Shared Networks

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

MPJava : High-Performance Message Passing in Java using Java.nio Bill Pugh Jaime Spacco University of Maryland, College Park.

E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Resource Virtualization Winter Quarter Project.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon notes from Introduction to Parallel Programming, Second.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Parallel Programming with Java

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.

PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.

1 Finding Constant From Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds Yifan Gong, Bingsheng He, Dan Li Nanyang Technological.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

Grid Computing slide to be used anywhere Harness global resources to improve performance.

CS 732: Advance Machine Learning

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.

Use of Performance Prediction Techniques for Grid Management Junwei Cao University of Warwick April 2002.

Typed Group Communication & Object-Oriented SPMD Laurent Baduel.

Memory Opportunity in Multicore Era

Summary Background Introduction in algorithms and applications

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Presentation transcript:

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk

2 Outline TimIt : A profiling tool for ProActive OO SPMD model in ProActive Performance evaluation with NAS benchmarks Optimizing group communications

3 TimIt : A profiling tool for ProActive A ProActive feature to time and analyze applications

4 OO SPMD model A parallel programming model Flexibility and high level of abstraction Strongly used in NAS benchmarks implementations One To AllScatteringReduce operation

5 NAS Parallel Benchmarks Designed by NASA to evaluate benefits of high performance systems Strongly based on CFD 5 benchmarks (kernels) to test different aspects of a system Easy to implement thanks to OOSPMD pattern Tests performed on Sun 1.5 with RMI for ProActive and PGI 6.0 compiler for MPI

6 CG Kernel (Conjugate Gradient) Floating point operations Eigen value computation High number of unstructured communications calls 570 MB sent 1 min % comms

7 MG Kernel (Multi Grid) Floating point operations Solving Poisson problem Structured communications 600 calls 45 MB sent 1 min % comms

8 IS Kernel (Integer Sort) Keys ranking operations Bucket sort Large arrays in memory 65 calls 22 MB sent 4 min % comms

9 EP Kernel (Embarrassingly Parallel) Random numbers generation Almost no communications 6 calls 246 bytes sent 7 min 32 2 % comms

10 FT Kernel (Fourier Transformation) Floating point operations Big messages : 8 MB per call 22 calls 180 MB sent 1 min % comms

11 Optimizing group communications Implement efficient group communication Minimize the TCP traffic Decrease the network congestion Use clustering techniques to choose the better algorithm to use

12 Ring all-to-all algorithm Best for large size communications Takes n-1 steps step1 2 3

13 Recursive doubling all-to-all algorithm Best for small size communications Takes log(n) steps step1 2

14 Conclusion TimIt : easy and helpful profiling tool NAS benchmarks easy to implements with ProActive and OO SPMD pattern Good performances expected with future Sun Java 6 and usage of Ibis RMI

15 Questions ?

16 MPI / ProActive MPIProActive Mpirundeployment MPI_Initactivities creation MPI_Finalize MPI_Comm_SizegetMyGroupSize MPI_Comm_rankgetMyRank MPI_*Sendmethod call (setter and getter) MPI_*Recv MPI_Barrierbarrier MPI_Bcastmethod call MPI_Scattermethod call with a scatter group as parameter MPI_Gatherresult of a group communication MPI_Reduceprogrammer's method Back