18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte Kim Massachusetts Institute of Technology Class of 2012

A * B = C (n x m) (m x p) (n x p) Matrix Multiplication

Hypothesis: The duration to compute (n x kn) * (kn x n) will take at least k times the duration to compute (n x n) * (n x n) regardless of parallelization if the same parallelization method is applied to both matmuls.

In both cases, resulting matrix C will be (n x n)

Ordinary Matrix Multiply

Under Ordinary Matrix Multiplication, (n x kn) * (kn x n) matmul will have k times the number of multiplication operations than (n x n) * (n x n) matmul

Test Case 1: Intel Atom N270 1.6 GHz 1 core 2 thread/core 2 threads total 56 KB L1 cache 512 KB L2 cache

Ordinary Matrix Multiply 1 thread ms n = 1024

Ordinary Matrix Multiply 2 threads ms n = 1024

Test Case 2: AMD Turion 64 X2 2.0 GHz 2 cores 1 thread/core 2 threads total 128 KB L1 cache per core 512 KB L2 cache per core

Observation Near doubling in performance going from 1 to 2 Threads. Calculation rate slowdown going from k = 3 to k = 4. Why? L2 cache access at k = 4.

Test Case 3: Intel Core2 Quad Q6700 2.66 GHz 4 cores 1 thread/core 4 threads total 128 KB L1 cache per core 2 x 4 MB L2 cache (shared)

Observation Near doubling in performance going from 1 to 2 Threads. At 4 Threads, increased computation slowdown at k=4, 7. Recoveries at k=6, 8. Effects of shared cache?

Ordinary Matrix Multiply All performance times observed were in accordance with the hypothesis.

Is there an algorithm that can give better than k scaling? The Question

Recursive Matrix Multiply Breaks up a matrix into 4 smaller matrices Spawns a new thread for each matrix Apply recursively, until threshold is reached.

Recursive Matrix Multiply ms n = 1024

Observation Recursive MatMul 1 to 3 times FASTER than Parallel Ordinary MatMul on the Atom processor. No drastic slowdown in computation rate after k = 1. Near linear relationship between calculation times and values of k.

Observation Recursive MatMul 1.5 to 3.5 times FASTER than Parallel Ordinary MatMul on the Turion processor. No drastic slowdown in computation rate between k=3 to k=4. Near linear relationship between calculation times and values of k.

Observation Recursive MatMul 0.5 to 4 times FASTER than Parallel Ordinary MatMul on the Q6700 processor. Better than k-scaling performance when k = 3, 5, 6, 7 and 8. Why?

Conclusions Better than k-scaling can be achieved, though uncertain as to why. Hardware? Algorithm? Combination of the two? Further research required.

Conclusions Algorithmic approach can affect time required. Hardware can affect time required. Faster processors help. More cache helps. But best peformance achieved when Algorithms can account for hardware and determine the best approach.

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

Similar presentations

Presentation on theme: "18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

Similar presentations

Presentation on theme: "18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware."— Presentation transcript:

Similar presentations

About project

Feedback