18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

Slides:



Advertisements
Similar presentations
A Synergetic Approach to Throughput Computing on IA Chi-Keung (CK) Luk TPI/DPD/SSG Intel Corporation Nov 16, 2010.
Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
OPTERON (Advanced Micro Devices). History of the Opteron AMD's server & workstation processor line 2003: Original Opteron released o 32 & 64 bit processing.
INTEL COREI3 INTEL COREI5 INTEL COREI7 Maryam Zeb Roll#52 GFCW Peshawar.
A many-core GPU architecture.. Price, performance, and evolution.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
1 Lecture 6 Performance Measurement and Improvement.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Memory Efficient Acceleration Structures and Techniques for CPU-based Volume Raycasting of Large Data S. Grimm, S. Bruckner, A. Kanitsar and E. Gröller.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Processors Menu  INTEL Core™ i Processor INTEL Core™ i Processor  INTEL Core i Processor INTEL Core i Processor  AMD A K.

18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Chapter 18 Multicore Computers
Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Different CPUs CLICK THE SPINNING COMPUTER TO MOVE ON.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
History of Microprocessor MPIntroductionData BusAddress Bus
Chapter One Understanding Computer Hardware Part I: Processors.
Select the 2nd Gen Intel® Core™ Processor that is Best for YourBusiness Intel® Core™ i3 Processor— Affordable Business PC. CPU Frequency 3.3 GHz with.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Intel Confidential - NDA Only *Other names and brands may be claimed as the property of others AMD* Athlon* 64 X (2x1 MB L2 Cache, 2.40 GHz) Intel®
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
ADA: 4.5. Matrix Mult.1 Objective o an extra divide and conquer example, based on a question in class Algorithm Design and Analysis (ADA) ,
Sunpyo Hong, Hyesoon Kim
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
Central Processing Unit (CPU) The Computer’s Brain.
FFT Accelerator Project Rohit Prakash(2003CS10186) Anand Silodia(2003CS50210) Date : February 23,2007.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
CPU Central Processing Unit
GCSE OCR Computing A451 The CPU Computing hardware 1.
Cache Memory and Performance
Ioannis E. Venetis Department of Computer Engineering and Informatics
OCR GCSE Computer Science Teaching and Learning Resources
Section 7: Memory and Caches
Hardware September 19, 2017.
Matrix Multiplication Continued
CSCI206 - Computer Organization & Programming
Core i7 micro-processor
Architecture Background
Computer Architecture 2
Adaptive Strassen and ATLAS’s DGEMM
STUDY AND IMPLEMENTATION
الوحدة الثالثة مكونات الحاسب.
All-Pairs Shortest Paths
CPU Key Revision Points.
Numerical Algorithms Quiz questions
Central Processing Unit
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Objectives Describe how common characteristics of CPUs affect their performance: clock speed, cache size, number of cores Explain the purpose and give.
Learning Objectives To be able to describe the purpose of the CPU
Intel CPU for Desktop PC: Past, Present, Future
Force Directed Placement: GPU Implementation
Presentation transcript:

/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware Impact Scaling Calculation Performance in Java Elliotte Kim Massachusetts Institute of Technology Class of 2012

A * B = C (n x m) (m x p) (n x p) Matrix Multiplication

Hypothesis: The duration to compute (n x kn) * (kn x n) will take at least k times the duration to compute (n x n) * (n x n) regardless of parallelization if the same parallelization method is applied to both matmuls.

In both cases, resulting matrix C will be (n x n)

Ordinary Matrix Multiply

Under Ordinary Matrix Multiplication, (n x kn) * (kn x n) matmul will have k times the number of multiplication operations than (n x n) * (n x n) matmul

Test Case 1: Intel Atom N GHz 1 core 2 thread/core 2 threads total 56 KB L1 cache 512 KB L2 cache

Ordinary Matrix Multiply 1 thread ms n = 1024

Ordinary Matrix Multiply 2 threads ms n = 1024

Test Case 2: AMD Turion 64 X2 2.0 GHz 2 cores 1 thread/core 2 threads total 128 KB L1 cache per core 512 KB L2 cache per core

Ordinary Matrix Multiply 1 thread ms n = 1024

Ordinary Matrix Multiply 2 threads ms n = 1024

Observation Near doubling in performance going from 1 to 2 Threads. Calculation rate slowdown going from k = 3 to k = 4. Why? L2 cache access at k = 4.

Test Case 3: Intel Core2 Quad Q GHz 4 cores 1 thread/core 4 threads total 128 KB L1 cache per core 2 x 4 MB L2 cache (shared)

Ordinary Matrix Multiply 1 thread ms n = 1024

Ordinary Matrix Multiply 2 threads ms n = 1024

Ordinary Matrix Multiply 4 threads ms n = 1024

Observation Near doubling in performance going from 1 to 2 Threads. At 4 Threads, increased computation slowdown at k=4, 7. Recoveries at k=6, 8. Effects of shared cache?

Ordinary Matrix Multiply All performance times observed were in accordance with the hypothesis.

Is there an algorithm that can give better than k scaling? The Question

Recursive Matrix Multiply Breaks up a matrix into 4 smaller matrices Spawns a new thread for each matrix Apply recursively, until threshold is reached.

Recursive Matrix Multiply ms n = 1024

Observation Recursive MatMul 1 to 3 times FASTER than Parallel Ordinary MatMul on the Atom processor. No drastic slowdown in computation rate after k = 1. Near linear relationship between calculation times and values of k.

Recursive Matrix Multiply ms n = 1024

Observation Recursive MatMul 1.5 to 3.5 times FASTER than Parallel Ordinary MatMul on the Turion processor. No drastic slowdown in computation rate between k=3 to k=4. Near linear relationship between calculation times and values of k.

Recursive Matrix Multiply ms n = 1024

Observation Recursive MatMul 0.5 to 4 times FASTER than Parallel Ordinary MatMul on the Q6700 processor. Better than k-scaling performance when k = 3, 5, 6, 7 and 8. Why?

Conclusions Better than k-scaling can be achieved, though uncertain as to why. Hardware? Algorithm? Combination of the two? Further research required.

Conclusions Algorithmic approach can affect time required. Hardware can affect time required. Faster processors help. More cache helps. But best peformance achieved when Algorithms can account for hardware and determine the best approach.