Download presentation
1
Matrix Multiplication in Hadoop
Siddharth Saraph
2
Matrix Multiplication
Matrices are very practical: sciences, engineering, statistics, etc. Multiplication is a fundamental nontrivial matrix operation. Simpler than something like matrix inversion (although the time complexity is the same).
3
Matrix Multiplication
Problem: Some people want to use enormous matrices. Cannot be handled on one machine Take advantage of map-reduce parallelism to approach this problem. Heuristics: 10,000x10,000: 100,000,000 entries 100,000x100,000: 10,000,000,000 entries In practice: sparse matrices.
4
First Step: Matrix Representation
How to represent a matrix for input to a map-reduce job? Convenient for sparse matrices: “Coordinate list format.” (row index, col index, value). (Board). Omit entries with value 0. Entries can be in an arbitrary order.
5
Second Step: Map Reduce Algorithm
Simple “entrywise” method. Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.
6
Second Step: Map Reduce Algorithm
Chalkboard.
7
Implementation Java, not Hadoop streaming. Why?
Seemed like a more complex project that would require more control. Custom Key and Value classes. Custom Partitioner class for the block method, for distributing keys to reducers. Learn java.
8
Performance Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix Sparse 1000x1000, .1, 6 splits, 12 reducers, 2.9MB: 5 minutes Sparse 5000x5000, .1, 20 splits, 20 reducers, 73MB: > 1 Hour
9
MATLAB Performance Windows 7, MATLAB 2015a 64-bit.
Engineering Library cluster, 4 GB RAM: 13,000x13,000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. LaFortune cluster, 16 GB RAM: 20,000x20, density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.
10
Improvements? Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. Strassen’s algorithm? O(n2.8), benefits of about 10% with matrix dimensions of few thousand. Use a different algorithm? Use a different platform? Spark?
11
Conclusion What happened to the enormous matrices?
From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.