Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College
Overview Raspberry Pi Cluster Build Performance Performance Benchmark Tools Tuning Analysis Conclusion
Raspberry Pi Cluster – Model B CPU – Broadcom ARM11 76JZF-S 700MHz Can overclock to 1000MHz RAM – 512MB 448MB/64MB – CPU/GPU Linux OS 10/100 BaseT Ethernet port Size of a credit card Price - $35 Image: http://commons.wikimedia.org/wiki/File:RaspberryPi.jpg#mediaviewer/File:RaspberryPi.jpg
Raspberry Pi Cluster - Setup Router – Western Digital N750 NFS mounted external hard drive – 500GB Buffalo Inc. 4 Pi cluster USB hub – Manhattan 7 port USB 2.0 SD card – Kodak 8GB ~$200 Transition to next slide with “After the cluster was set up, we wanted to measure performance”
Performance – Floating Point Operations How we measure performance Busy CPU? Speed? What is a floating point operation (FLOP)? Arithmetic operation Formats Single Double Performance measurement by FLOPS How do we measure FLOPS? General Matrix Multiplication (GEMM) High computational intensity with an increase in matrix size
Performance Benchmarking Single Raspberry Pi BLAS - Basic Linear Algebra Subprograms ATLAS - Automatically Tuned Linear Algebra Software Auto tunes BLAS for any system Raspberry Pi Cluster MPI - Message Passing Interface Standard API for inter-process communication Facilitates parallel programming MPICH 2-1.4.1p1 HPL - High Performance LINPACK Tuned MPI Combined with ATLAS Wrote Custom code ATLAS Added parallel capability Compared with HPL
MyGEMM – Naïve Implementation = * C (i,j) A(i,k) B (k,j) Where N = Matrix Size For i = 1 to N For j = 1 to N For k = 1 to N C(i,j)=C(i,j)+A(i,k)*B(k,j) striding bad! ----- Meeting Notes (7/30/14 13:38) ----- Transistion to benchmarking slide
MyGEMM – Naïve Pitfalls GEMM – Naïve method inefficient Two whole matrices are loaded in to memory Cache is not used efficiently Strides through the matrix If not Naïve, then what? Naïve
MyGEMM – Software Tuning What is block matrix multiplication Matrix is split into smaller matrices/blocks Shrinks matrix size to allow both A & B into fast memory A11 A12 A13 A14 A21 A22 A23 A24 Reduces striding by fully utilizing memory hierarchy Transistion After we tuned for a single node we scaled it to our cluster a11 a12 A11 A31 A32 A33 A34 a21 a22 A41 A42 A43 A44
MyGEMM – Cluster Software Tuning How do we distribute the matrix multiplication? MPI used to distribute blocks to nodes MyGEMM allows for experimentation on block size A11 A12 A13 A14 Node 1 A21 A22 A23 A24 Node 2 A31 A32 A33 A34 Start with further division of matrix Why to divide matrix? How do I reduce matrix size? Divided by number of nodes in the cluster allows matrix multiplication on higher sizes What is cache blocking? reduce the matrix to the exact size of my cache Why blocks? Row/Column wise multiplication Loop unrolling ----- Meeting Notes (7/28/14 13:54) ----- reduces matrix to fit the cache block size is reduced to fit in cache ----- Meeting Notes (7/30/14 08:29) ----- remove bullet 1 add mygemm parallel talk how to distribute the matrix A41 Node 3 A42 A43 A44 Node 4
Hardware Tuning Raspberry Pi allows CPU overclocking and memory sharing between CPU and GPU Memory Memory is shared between CPU and GPU 512MB total onboard memory Up to 496MB can be used for CPU More memory = larger matrices CPU Clock Up to 1000 MHz from 700MHz
Conclusion Performance is dependent on key factors Top 500 – 1993 Matrix Size Block Size RAM size CPU speed Top 500 – 1993 T#292 Tied with General Motors ----- Meeting Notes (7/30/14 16:07) ----- move after yellowstone
Conclusion – Pi Cluster vs. Yellowstone ARM11, 32 bit, Double Precision Yellowstone Sandy Bridge Xeon, 64 bit, Double Precision Power 15.6 Watts 1.4 MWatts Performance 836 MFLOPS 1.2 PFLOPS Price $250.00 $22,500,000.00 MFLOPS/W 54.8 875.3 MFLOPS/$ 3.4 57.2
Thank You Dr. Richard Loft Raghu Raj Prasanna Kumar Amogh Simha Stephanie Barr
Questions?