A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

Slides:



Advertisements
Similar presentations
Code Tuning Strategies and Techniques
Advertisements

Part IV: Memory Management
Verification and Validation
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
Fundamentals of Python: From First Programs Through Data Structures
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Chapter 7 Memory Management Operating Systems: Internals and Design Principles, 6/E William Stallings Dave Bremer Otago Polytechnic, N.Z. ©2009, Prentice.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Introduction CS 524 – High-Performance Computing.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Chapter 4 Assessing and Understanding Performance
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Simple Sorting Algorithms. 2 Bubble sort Compare each element (except the last one) with its neighbor to the right If they are out of order, swap them.
College Algebra Fifth Edition James Stewart Lothar Redlin Saleem Watson.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Code-Tuning By Jacob Shattuck. Code size/complexity vs computation resource utilization A classic example: Bubblesort A classic example: Bubblesort const.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
Operating Systems.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Independent Component Analysis (ICA) A parallel approach.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
Analysis of Algorithms
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
1 C++ Classes and Data Structures Jeffrey S. Childs Chapter 4 Pointers and Dynamic Arrays Jeffrey S. Childs Clarion University of PA © 2008, Prentice Hall.
Arrays Arrays in C++ An array is a data structure which allows a collective name to be given to a group of elements which all have.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
Searching Topics Sequential Search Binary Search.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Design and Analysis of Algorithms Chapter -2
Mechanism: Limited Direct Execution
Central Processing Unit- CPU
Effective Data-Race Detection for the Kernel
Main Memory Management
While Loops BIS1523 – Lecture 12.
More Selections BIS1523 – Lecture 9.
Memory Hierarchies.
Sort Techniques.
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Quick Tutorial on MPICH for NIC-Cluster
Arrays.
6- General Purpose GPU Programming
Presentation transcript:

A COMPARISON MPI vs POSIX Threads

Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware  Dual 6 Core (2 threads per core) 12 logical   Intel Xeon CPU E5 – 2667 (show schematic)   2.96 GHz  15 MB L3 Cache Shared 2.5MB per core All code / output / analysis available here: 

About the Time Trials Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory  Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys  Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. network latency isn’t the weak link. So this analysis only makes sense on 1 machine Use Matrix Matrix multiply code we developed over the semester  Everyone is familiar with the code and can make observations    Use square matrices  Not necessary but it made things more convenient Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of bigger ones) Matrix A will be filled with 1-n Left to Right and Top Down Matrix B will be the identity matrix  Can then check our results easily as A*B = A when B = identity matrix   Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing Set up test bed  Try each step individually, check results, then automate

Specifics cont. About the runs  For each MATRIX size (500 -> 3000,4000, 5000, 6000,7000,8000,9000,10000)  Vary thread count 2-12 (POSIX)  Vary Processes 2-12 (MPI)  Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks)  With later runs I ran 12, dropped high and low then took average Try Make observations about anomalies in the run times where appropriate Caveats  All initial runs with no optimization for testing, but hey this is a class about performance  Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)  First level optimization made a huge difference > 3 x improvement  GNU Optimization explanation can be found here:  Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)  Not all optimizations are flag controlled  Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations Oh No moment **  Huge improvement in performance with optimized code, why?  I now Believe the main performance improvement came from loop unrolling.  Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was?  Came back and made matrix B non Identity, same performance. Whew.  OK - Ready to make the runs

Discussion Please chime in as questions come up. Process Explanation: (After initial testing and verification)   top –d.1 (tap 1 to show CPU list tap H to show threads) Attempted a 25,000 x 25,000 matrix  Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)   Not an issue for POSIX threads (until you run out of memory on the machine) swap Settled on 12 Processes / Threads because of the number of cores available  Do you get enhanced or degraded performance by exceeding that number?  Example of process space / top output (10,000 x 10,000)  Early testing, before runs started. Pre Optimization   Use >> top –d t (t in floating point secs ; linux) hit “1” key to see list of the cores Take a look at some numbers    

Time Comparison

Time Comparison In all these cases time for 5,4, 3, 2 processes much longer than 6 so left of for comparison MPI Doesn’t “catch” back up till 11 processes POSIX Doesn’t “catch” back up till 9 processes

MPI Time Curve

POSIX Time Curve

POSIX Threads Vs MPI Processes Run Times Matrix Sizes 4000x4000 – 10,000 x 10,000

POSIX Threads 1500 x 1500 – 2500x2500

MPI 1500 x 1500 – 1800 x 1800 Notice MPI Didn’t exhibit the same problem at size 1600 as POSIX and NO MP case.

POSIX & NO MP 1600 x 1600 case Straight C runs long enough to see top output (here I can see the memory usage)  threaded,MPI, and non mp code share same basic structure for calculating “C” Matrix Suspect some kind of boundary issue here, possibly “false sharing”? Process fits entirely in shared L3 cache 15 MB x 2 = 30MB Do same number of calculations but make initial array allocations larger (shown below) ~/SUNY]$ foreach NUM_TRIALS ( ) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time secs ~/SUNY]$ foreach NUM_TRIALS ( ) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time secs ~/SUNY]$

Notes / Future Directions Start MPI Timer after communication. Is coms the sole source of difference? <- TESTED NO At the boundary conditions the driving force is the amount of memory allocated on the heap.  Not the number of calculations being performed Intel had a nice article about false sharing:   link to a product they sell for detecting false sharing on their processors Combo MPI and POSIX Threads?  MPI to multiple machines, then POSIX threads ?  Found this paper on OpenMP vs direct POSIX programming (similar tests)  Couldn’t get MPE running with MPIch (would like to re-investigate why) Investigate optimization techniques  Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO  Rerun with non-identity B matrix and compare times <- DONE Try different languages ie CHAPEL Try different algorithms For < 6 processes look at thread_affinity and assignment of threads to a physical processor  There is no gaurantee that with 6 or less processes they will all reside on same physical processor  Noticed CPU switching occaionally.  Setting the affinity can mitigate this, thread = assigned and not “allowed” to move

Notice the shape of the curves for both MPI and POSIX solutions. There is definitely a point of diminishing returns. 6? In this particular case. Instead of using 12 cores could we cut the problem set in half and launch 2 independent 6 process solutions by declaring thread_affinity?  Would this produce better results?  How to merge the 2 process spaces? Notes / Future Directions cont.