Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang

Slides:

Advertisements

Similar presentations

Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.

Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides Courtesy Michael J. Quinn Parallel Programming in C.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

University of Southern California Department Computer Science Bayesian Logistic Regression Model (Final Report) Graduate Student Teawon Han Professor Schweighofer,

Generalized Fuzzy Clustering Model with Fuzzy C-Means Hong Jiang Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, US.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

These slides are based on the book:

Neural Network Architecture Session 2

Employing compression solutions under openacc

Xing Cai University of Oslo

Thank you, chairman for the kind introduction. And hello, everyone.

Distributed Processors

Tohoku University, Japan

U.C. Berkeley Millennium Project

Software Cache Coherent Control by Parallelizing Compiler

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Guoliang Chen Parallel Computing Guoliang Chen

Historic Document Image De-Noising using Principal Component Analysis (PCA) and Local Pixel Grouping (LPG) Han-Yang Tang1, Azah Kamilah Muda1, Yun-Huoy.

GPU Implementations for Finite Element Methods

Memory Opportunity in Multicore Era

OVERVIEW OF BIOLOGICAL NEURONS

CLUSTER COMPUTING.

University of Wisconsin-Madison

CARLA Buenos Aires, Argentina - Sept , 2017

Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.

By Brandon, Ben, and Lee Parallel Computing.

병렬처리시스템 2005년도 2학기 채 수 환

Introduction to Engineering Design II (IE 202)

Ernest Valveny Computer Vision Center

Database System Architectures

Chapter 01: Introduction

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Learning and Memorization

Random Neural Network Texture Model

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

Presentation transcript:

Parallel Program Performance Evaluation and Their Behavior Analysis on an OpenMP Cluster Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang Institute of Computing Technology Chinese Academy of Sciences

Motivation of This Study Distributed memory architectures using clusters of PCs or workstations with commodity networking have become an increasingly attractive alternative for high end parallel computing. The OpenMP API is an emerging standard for parallel programming on shared memory multiprocessors. There are some researches on OpenMP/DSM approach Evaluation of OpenMP program's performance and analysis of their behavior will help us support the OpenMP on cluster with DSM approach efficiently

Background JIAJIA: a home-based software DSM that supports scope consistency. Our OpenMP compiler: a source to source compiler built on SUIF architecture .

The Architecture of Our OpenMP Compiler An OpenMP program C preprocessor OMP2JIA translator S2C program A JIAJIA program gcc Binary code JIAJIA Runtime library Cluster

The Experimental Approach We Used We compared two versions of 6 benchmark’s performance to investigate how the characteristic of DSM and the translations together with the original program behavior determine the performance of OpenMP program on our cluster.

Classification of Data Sharing and Communication Patterns Our classification is based on the behavior of the iterations of parallel loops. In an iteration of a parallel loop, a processor may compute different kinds of objects. These objects can be an item of a vector or a line of a 2-D array or a plane of a cube, etc. According to the problem the application solves. The object computed by a processor during an iteration of a parallel loop is the smallest granularity of parallelism and data layout in OpenMP programs, and it is the basic unit to be distributed among processors for parallelism.

Classification (contd.) When a processor computes the object assigned to the iteration it computes, the source of the data can be categorized into five classes.

Classification (contd.) In the first case, all the data accessed by the processor belong to the object itself, which the iteration computes. In the second case, the processor will access the data belonging to the objects, which is belonging to the nearest iterations.

Classification (contd.) In the third case, the processor will access the data, which is belonging to all of the objects. In the fourth case, the processor will access the data belonging to an object, that is belonging to a special iteration. In the last case, the processor uses an array to index its access to the objects.

Benchmark Introduction The applications we used include a range of scientific, engineering programs. Three of them are from NAS parallel benchmark. Another three programs are from SPEC OMPM2001. They are ART, SWIM, and EQUAKE.

CG CG uses a conjugate gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. The first and the last sharing patterns are found in this benchmark. The testing size we employed is B.

MG MG is a simplified multigrid kernel, which solves a 3-D Poisson PDE. The first and the second sharing patterns are found in this benchmark. The testing size we employed is A.

EP EP is an ``embarrassingly parallel'' problem. In this benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudorandom numbers, which are generated according to a particular scheme that is well suited for parallel computation. This problem is typical of many ``Monte-Carlo'' applications. Only the first sharing pattern is found in this benchmark. The testing size we employed is A

ART ART uses the Adaptive Resonance Theory 2 (ART 2) neural network to recognize objects in a thermal image. Only the first sharing pattern is found in this benchmark. The testing size we employed is REF.

SWIM SWIM comes from weather prediction field. It solves shallow water equation on a 2-D array. The first and the second sharing patterns are found in this benchmark. The testing size we employed is TRAIN.

EQUAKE EQUAKE simulates the propagation of elastic waves in large, highly heterogeneous valleys, such as California's San Fernando Valley, or the Greater Los Angeles Basin. The first, the third and the fifth sharing patterns are found in this benchmark. The testing size we employed is REF. For convenience of performance measurement, we reduced the iterations from 3334 to 120.

Main Difference Between Two Versions Compiler-translated version: the pages of a shared array are distributed among processor with block scheme and the computation is distributed among processors evenly. Hand-translated version : the pages of a shared array are distributed to the processor that produce them. The computation is distributed among processors according to data layout.

Behavior Changes Caused by Translation Program name Behavior changes introduced by translation Performance ratio to hand translated version CG Computation distribution 85.5% MG Data layout 65.6% EP No 99.1% ART 96.9% SWIM 80.4% EQUAKE 139.6%

Conclusion Using block scheme to treat data layout and computation distribution performs quite well for all the applications we tested. The stagger of data layout and computation distribution causes most of the overheads introduced by translation. This leads the main slowdown to our hand-translated version. The translated program’s performance is limited by the program’s basic behaviors on DSM.

Thanks！！！