Raspberry Pi Performance Benchmarking

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Symantec 2010 Windows 7 Migration Global Results.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2003 Chapter 11 Ethernet Evolution: Fast and Gigabit Ethernet.
Processes and Operating Systems
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Two-step linear equations Variables.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.
ZMQS ZMQS
PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Solve Multi-step Equations
Break Time Remaining 10:00.
Figure 12–1 Basic computer block diagram.
Augmenting FPGAs with Embedded Networks-on-Chip
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
PP Test Review Sections 6-1 to 6-6
ABC Technology Project
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
Operating Systems Operating Systems - Winter 2011 Dr. Melanie Rieback Design and Implementation.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Sets Sets © 2005 Richard A. Medeiros next Patterns.
Addition 1’s to 20.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
Equal or Not. Equal or Not
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Week 1.
TRIANGULAR MATRICES A square matrix in which all the entries above the main diagonal are zero is called lower triangular, and a square matrix in which.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
The DDS Benchmarking Environment James Edmondson Vanderbilt University Nashville, TN.
Immunobiology: The Immune System in Health & Disease Sixth Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Energy Generation in Mitochondria and Chlorplasts
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Presentation transcript:

Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College

Overview Raspberry Pi Cluster Build Performance Performance Benchmark Tools Tuning Analysis Conclusion

Raspberry Pi Cluster – Model B CPU – Broadcom ARM11 76JZF-S 700MHz Can overclock to 1000MHz RAM – 512MB 448MB/64MB – CPU/GPU Linux OS 10/100 BaseT Ethernet port Size of a credit card Price - $35 Image: http://commons.wikimedia.org/wiki/File:RaspberryPi.jpg#mediaviewer/File:RaspberryPi.jpg

Raspberry Pi Cluster - Setup Router – Western Digital N750 NFS mounted external hard drive – 500GB Buffalo Inc. 4 Pi cluster USB hub – Manhattan 7 port USB 2.0 SD card – Kodak 8GB ~$200 Transition to next slide with “After the cluster was set up, we wanted to measure performance”

Performance – Floating Point Operations How we measure performance Busy CPU? Speed? What is a floating point operation (FLOP)? Arithmetic operation Formats Single Double Performance measurement by FLOPS How do we measure FLOPS? General Matrix Multiplication (GEMM) High computational intensity with an increase in matrix size

Performance Benchmarking Single Raspberry Pi BLAS - Basic Linear Algebra Subprograms ATLAS - Automatically Tuned Linear Algebra Software Auto tunes BLAS for any system Raspberry Pi Cluster MPI - Message Passing Interface Standard API for inter-process communication Facilitates parallel programming MPICH 2-1.4.1p1 HPL - High Performance LINPACK Tuned MPI Combined with ATLAS Wrote Custom code ATLAS Added parallel capability Compared with HPL

MyGEMM – Naïve Implementation = * C (i,j) A(i,k) B (k,j) Where N = Matrix Size For i = 1 to N For j = 1 to N For k = 1 to N C(i,j)=C(i,j)+A(i,k)*B(k,j) striding bad! ----- Meeting Notes (7/30/14 13:38) ----- Transistion to benchmarking slide

MyGEMM – Naïve Pitfalls GEMM – Naïve method inefficient Two whole matrices are loaded in to memory Cache is not used efficiently Strides through the matrix If not Naïve, then what? Naïve

MyGEMM – Software Tuning What is block matrix multiplication Matrix is split into smaller matrices/blocks Shrinks matrix size to allow both A & B into fast memory A11 A12 A13 A14 A21 A22 A23 A24 Reduces striding by fully utilizing memory hierarchy Transistion After we tuned for a single node we scaled it to our cluster a11 a12 A11 A31 A32 A33 A34 a21 a22 A41 A42 A43 A44

MyGEMM – Cluster Software Tuning How do we distribute the matrix multiplication? MPI used to distribute blocks to nodes MyGEMM allows for experimentation on block size A11 A12 A13 A14 Node 1 A21 A22 A23 A24 Node 2 A31 A32 A33 A34 Start with further division of matrix Why to divide matrix? How do I reduce matrix size? Divided by number of nodes in the cluster allows matrix multiplication on higher sizes What is cache blocking? reduce the matrix to the exact size of my cache Why blocks? Row/Column wise multiplication Loop unrolling ----- Meeting Notes (7/28/14 13:54) ----- reduces matrix to fit the cache block size is reduced to fit in cache ----- Meeting Notes (7/30/14 08:29) ----- remove bullet 1 add mygemm parallel talk how to distribute the matrix A41 Node 3 A42 A43 A44 Node 4

Hardware Tuning Raspberry Pi allows CPU overclocking and memory sharing between CPU and GPU Memory Memory is shared between CPU and GPU 512MB total onboard memory Up to 496MB can be used for CPU More memory = larger matrices CPU Clock Up to 1000 MHz from 700MHz

Conclusion Performance is dependent on key factors Top 500 – 1993 Matrix Size Block Size RAM size CPU speed Top 500 – 1993 T#292 Tied with General Motors ----- Meeting Notes (7/30/14 16:07) ----- move after yellowstone

Conclusion – Pi Cluster vs. Yellowstone ARM11, 32 bit, Double Precision Yellowstone Sandy Bridge Xeon, 64 bit, Double Precision Power 15.6 Watts 1.4 MWatts Performance 836 MFLOPS 1.2 PFLOPS Price $250.00 $22,500,000.00 MFLOPS/W 54.8 875.3 MFLOPS/$ 3.4 57.2

Thank You Dr. Richard Loft Raghu Raj Prasanna Kumar Amogh Simha Stephanie Barr

Questions?