Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang

Parallel Program Performance Evaluation and Their Behavior Analysis on an OpenMP Cluster
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang Institute of Computing Technology Chinese Academy of Sciences

Motivation of This Study
Distributed memory architectures using clusters of PCs or workstations with commodity networking have become an increasingly attractive alternative for high end parallel computing. The OpenMP API is an emerging standard for parallel programming on shared memory multiprocessors. There are some researches on OpenMP/DSM approach Evaluation of OpenMP program's performance and analysis of their behavior will help us support the OpenMP on cluster with DSM approach efficiently

Background JIAJIA: a home-based software DSM that supports scope consistency. Our OpenMP compiler: a source to source compiler built on SUIF architecture .

The Architecture of Our OpenMP Compiler
An OpenMP program C preprocessor OMP2JIA translator S2C program A JIAJIA program gcc Binary code JIAJIA Runtime library Cluster

The Experimental Approach We Used
We compared two versions of 6 benchmark’s performance to investigate how the characteristic of DSM and the translations together with the original program behavior determine the performance of OpenMP program on our cluster.

Classification of Data Sharing and Communication Patterns
Our classification is based on the behavior of the iterations of parallel loops. In an iteration of a parallel loop, a processor may compute different kinds of objects. These objects can be an item of a vector or a line of a 2-D array or a plane of a cube, etc. According to the problem the application solves. The object computed by a processor during an iteration of a parallel loop is the smallest granularity of parallelism and data layout in OpenMP programs, and it is the basic unit to be distributed among processors for parallelism.

Classification (contd.)
When a processor computes the object assigned to the iteration it computes, the source of the data can be categorized into five classes.

In the first case, all the data accessed by the processor belong to the object itself, which the iteration computes. In the second case, the processor will access the data belonging to the objects, which is belonging to the nearest iterations.

In the third case, the processor will access the data, which is belonging to all of the objects. In the fourth case, the processor will access the data belonging to an object, that is belonging to a special iteration. In the last case, the processor uses an array to index its access to the objects.

Benchmark Introduction
The applications we used include a range of scientific, engineering programs. Three of them are from NAS parallel benchmark. Another three programs are from SPEC OMPM2001. They are ART, SWIM, and EQUAKE.

CG CG uses a conjugate gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. The first and the last sharing patterns are found in this benchmark. The testing size we employed is B.

MG MG is a simplified multigrid kernel, which solves a 3-D Poisson PDE. The first and the second sharing patterns are found in this benchmark. The testing size we employed is A.

EP EP is an ``embarrassingly parallel'' problem. In this benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudorandom numbers, which are generated according to a particular scheme that is well suited for parallel computation. This problem is typical of many ``Monte-Carlo'' applications. Only the first sharing pattern is found in this benchmark. The testing size we employed is A

ART ART uses the Adaptive Resonance Theory 2 (ART 2) neural network to recognize objects in a thermal image. Only the first sharing pattern is found in this benchmark. The testing size we employed is REF.

SWIM SWIM comes from weather prediction field. It solves shallow water equation on a 2-D array. The first and the second sharing patterns are found in this benchmark. The testing size we employed is TRAIN.

EQUAKE EQUAKE simulates the propagation of elastic waves in large, highly heterogeneous valleys, such as California's San Fernando Valley, or the Greater Los Angeles Basin. The first, the third and the fifth sharing patterns are found in this benchmark. The testing size we employed is REF. For convenience of performance measurement, we reduced the iterations from 3334 to 120.

Main Difference Between Two Versions
Compiler-translated version: the pages of a shared array are distributed among processor with block scheme and the computation is distributed among processors evenly. Hand-translated version : the pages of a shared array are distributed to the processor that produce them. The computation is distributed among processors according to data layout.

Behavior Changes Caused by Translation
Program name Behavior changes introduced by translation Performance ratio to hand translated version CG Computation distribution 85.5% MG Data layout 65.6% EP No 99.1% ART 96.9% SWIM 80.4% EQUAKE 139.6%

Conclusion Using block scheme to treat data layout and computation distribution performs quite well for all the applications we tested. The stagger of data layout and computation distribution causes most of the overheads introduced by translation. This leads the main slowdown to our hand-translated version. The translated program’s performance is limited by the program’s basic behaviors on DSM.

Thanks！！！

Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang

Similar presentations

Presentation on theme: "Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang

Similar presentations

Presentation on theme: "Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang"— Presentation transcript:

Similar presentations

About project

Feedback