A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

CSS430 Memory Management Textbook Ch8

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.

Chapter Chapter 4. Think back to any very difficult quantitative problem that you had to solve in some science class How long did it take? How many times.

Fundamentals of Python: From First Programs Through Data Structures

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Matrices: Inverse Matrix

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Introduction CS 524 – High-Performance Computing.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

College Algebra Fifth Edition James Stewart Lothar Redlin Saleem Watson.

Code-Tuning By Jacob Shattuck. Code size/complexity vs computation resource utilization A classic example: Bubblesort A classic example: Bubblesort const.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Independent Component Analysis (ICA) A parallel approach.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Overview Definitions Basic matrix operations (+, -, x) Determinants and inverses.

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Drew Freer, Beayna Grigorian, Collin Lambert, Alfonso Roman, Brian Soumakian.

What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

1 CS 177 Week 12 Recitation Slides Running Time and Performance.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

4.5 Inverse of a Square Matrix

Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.

Georgia Institute of Technology Speed part 4 Barb Ericson Georgia Institute of Technology May 2006.

EOVSA EST DPP Testing J. McTiernan EOVSA Prototype Review 24-Sep-2012.

1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.

Sunpyo Hong, Hyesoon Kim

Exploring Parallelism with Joseph Pantoga Jon Simington.

Finishing up Chapter 5. Will this code enter the if statement? G=[30,55,10] if G

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Arrays and Loops. Learning Objectives By the end of this lecture, you should be able to: – Understand what a loop is – Appreciate the need for loops and.

Systems of Equations and Matrices Review of Matrix Properties Mitchell.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

Analysis of Sparse Convolutional Neural Networks

13.4 Product of Two Matrices

Chapter 2 Memory and process management

Chapter 9 – Real Memory Organization and Management

Main Memory Management

While Loops BIS1523 – Lecture 12.

Memory Hierarchies.

CS 240 – Lecture 9 Bit Shift Operations, Assignment Expressions, Modulo Operator, Converting Numeric Types to Strings.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

Lets Play with arrays Singh Tripty

Module Recognition Algorithms

Quick Tutorial on MPICH for NIC-Cluster

COMP755 Advanced Operating Systems

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

A COMPARISON MPI vs POSIX Threads

Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware  Dual 6 Core (2 threads per core) 12 logical   Intel Xeon CPU E5 – 2667 (show schematic)   2.96 GHz  15 MB L3 Cache All code / output / analysis available here: 

Specifics Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory  Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys  Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. The network doesn’t get involved) Only makes sense with 1 machine Set up test bed  Try each step individually, check results, then automate Use Matrix Matrix multiply code we developed over the semester  Everyone is familiar with the code and can make observations    Use square matrices Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big ones) Matrix A will be filled with 1-n Left to Right and Top Down Matrix B will be the identity matrix  Can then check our results easily as A*B = A when B = identity matrix   Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing

Matrix Sizes MATRIX SIZENUM ELEMENTSLOOP CALCULATIONS N multiplies N-1 Adds E E E E E E E+12 Third Column: Just the number of calculations inside the loop for calculating the matrix elements

Specifics cont. About the runs  For each MATRIX size (500 -> 3000,4000, 5000, 6000,7000,8000,9000,10000)  Vary thread count 2-12 (POSIX)  Vary Processes 2-12 (MPI)  Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks) Make observations about anomalies in the run times where appropriate Caveats  All initial runs with no optimization for testing, but hey this is a class about performance  Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)  First level optimization made a huge difference > 3 x improvement  GNU Optimization explanation can be found here:  Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)  Not all optimizations are flag controlled  Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations Oh No moment **  Huge improvement in performance with optimized code, why?  What if the improvement in performance ( from compiler optimization) was due to the identity matrix?  Came back and made matrix B non Identity, same performance. Whew.  I now Believe the main performance improvement came from loop unrolling.  Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was?  Came back and made matrix B non Identity, same performance. Whew.  Ready to make the runs

Discussion Please chime in as questions come up. Process Explanation: (After initial testing and verification)  Attempted a 25,000 x 25,000 matrix  Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)   Not an issue for POSIX threads (until you run out of memory on the machine) swap Settled on 12 Processes / Threads because of the number of cores available  Do you get enhanced or degraded performance by exceeding that number?  Example of process space / top output (10,000 x 10,000)  Early testing, before runs started. Pre Optimization 

Time Comparison (Boring)

Time Comparison (still boring…) In all these cases time for 5,4, 3, 2 processes much longer than 6 so left of for comparison MPI Doesn’t “catch” back up till 11 processes POSIX Doesn’t “catch” back up till 9 processes

MPI Time Curve

POSIX Time Curve

POSIX Threads Vs MPI Processes Run Times Matrix Sizes 4000x4000 – 10,000 x 10,000

POSIX Threads 1500 x 1500 – 2500x2500

1600 x 1600 case Straight C runs long enough to see top output (here I can see the memory usage)  threaded,MPI, and non mp code share same basic structure for calculating “C” Matrix Suspect some kind of boundary issue here, possibly “false sharing”? Process fits entirely in shared L3 cache 15 MB x 2 = 30MB Do same number of calculations but make initial array allocations larger (shown below) ~/SUNY]$ foreach NUM_TRIALS ( ) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs ~/SUNY]$ foreach NUM_TRIALS ( ) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs ~/SUNY]$

Future Directions POSIX Threads with Network memory? (NFS) Combo MPI and POSIX Threads?  MPI to multiple machines, then POSIX threads ?   POSIX threads that launch MPI ? Couldn’t get MPE running with MPIch (would like to re-investigate why) Investigate optimization techniques  Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO  Rerun with non-identity B matrix and compare times <- DONE Try different languages ie CHAPEL Try different algorithms Want to add OpenMP to the mix  Found this paper on OpenMP vs direct POSIX programming (similar tests)  For < 6 processes look at thread_affinity and assignment of threads to a physical processor