A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

A COMPARISON MPI vs POSIX Threads

Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware  Dual 6 Core (2 threads per core) 12 logical  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt  Intel Xeon CPU E5 – 2667 (show schematic)  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf  2.96 GHz  15 MB L3 Cache All code / output / analysis available here:  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/ http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/

Specifics Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory  Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys  Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. The network doesn’t get involved) Only makes sense with 1 machine Set up test bed  Try each step individually, check results, then automate Use Matrix Matrix multiply code we developed over the semester  Everyone is familiar with the code and can make observations  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c Use square matrices Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big ones) Matrix A will be filled with 1-n Left to Right and Top Down Matrix B will be the identity matrix  Can then check our results easily as A*B = A when B = identity matrix  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt  Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing

Matrix Sizes MATRIX SIZENUM ELEMENTSLOOP CALCULATIONS N multiplies N-1 Adds 500250000249750000 600360000431640000 700490000685510000 8006400001023360000 9008100001457190000 100010000001999000000 110012100002660790000 120014400003454560000 130016900004392310000 140019600005486040000 150022500006747750000 160025600008189440000 170028900009823110000 1800324000011660760000 1900361000013714390000 2000400000015996000000 2100441000018517590000 2200484000021291160000 2300529000024328710000 2400576000027642240000 2500625000031243750000 2600676000035145240000 2700729000039358710000 2800784000043896160000 2900841000048769590000 3000900000053991000000 4000160000001.27984E+11 5000250000002.49975E+11 6000360000004.31964E+11 7000490000006.85951E+11 8000640000001.02394E+12 9000810000001.45792E+12 100001000000001.9999E+12 Third Column: Just the number of calculations inside the loop for calculating the matrix elements

Specifics cont. About the runs  For each MATRIX size (500 -> 3000,4000, 5000, 6000,7000,8000,9000,10000)  Vary thread count 2-12 (POSIX)  Vary Processes 2-12 (MPI)  Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks) Make observations about anomalies in the run times where appropriate Caveats  All initial runs with no optimization for testing, but hey this is a class about performance  Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)  First level optimization made a huge difference > 3 x improvement  GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.htmlhttp://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html  Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)  Not all optimizations are flag controlled  Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations Oh No moment **  Huge improvement in performance with optimized code, why?  What if the improvement in performance ( from compiler optimization) was due to the identity matrix?  Came back and made matrix B non Identity, same performance. Whew.  I now Believe the main performance improvement came from loop unrolling.  Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was?  Came back and made matrix B non Identity, same performance. Whew.  Ready to make the runs

Discussion Please chime in as questions come up. Process Explanation: (After initial testing and verification)  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt Attempted a 25,000 x 25,000 matrix  Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt  Not an issue for POSIX threads (until you run out of memory on the machine) swap Settled on 12 Processes / Threads because of the number of cores available  Do you get enhanced or degraded performance by exceeding that number?  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt Example of process space / top output (10,000 x 10,000)  Early testing, before runs started. Pre Optimization  http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt

Time Comparison (Boring)

Time Comparison (still boring…) In all these cases time for 5,4, 3, 2 processes much longer than 6 so left of for comparison MPI Doesn’t “catch” back up till 11 processes POSIX Doesn’t “catch” back up till 9 processes

MPI Time Curve

POSIX Time Curve

POSIX Threads Vs MPI Processes Run Times Matrix Sizes 4000x4000 – 10,000 x 10,000

POSIX Threads 1500 x 1500 – 2500x2500

1600 x 1600 case Straight C runs long enough to see top output (here I can see the memory usage)  threaded,MPI, and non mp code share same basic structure for calculating “C” Matrix Suspect some kind of boundary issue here, possibly “false sharing”? Process fits entirely in shared L3 cache 15 MB x 2 = 30MB Do same number of calculations but make initial array allocations larger (shown below) [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 ) foreach?./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs [rahnbj@rage ~/SUNY]$

Future Directions POSIX Threads with Network memory? (NFS) Combo MPI and POSIX Threads?  MPI to multiple machines, then POSIX threads ?  http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads  POSIX threads that launch MPI ? Couldn’t get MPE running with MPIch (would like to re-investigate why) Investigate optimization techniques  Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO  Rerun with non-identity B matrix and compare times <- DONE Try different languages ie CHAPEL Try different algorithms Want to add OpenMP to the mix  Found this paper on OpenMP vs direct POSIX programming (similar tests)  http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf For < 6 processes look at thread_affinity and assignment of threads to a physical processor

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Similar presentations

Presentation on theme: "A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Similar presentations

Presentation on theme: "A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread."— Presentation transcript:

Similar presentations

About project

Feedback