ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Compiler Construction Lecture 17 Mapping Variables to Memory.
Computer System Architectures Computer System Software
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
CS203 – Advanced Computer Architecture Performance Evaluation.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
CS203 – Advanced Computer Architecture
Chapter 1 Introduction.
Parallel processing is not easy
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Parallel Computing Lecture
Resource Elasticity for Large-Scale Machine Learning
Chapter 1 Introduction.
What is Parallel and Distributed computing?
EE 4xx: Computer Architecture and Performance Programming
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney

2 MATLAB and Octave  MATLAB  High level, interpreted, un-typed language  Very popular among scientists and engineers  Simple sequential semantics for expressing algorithms with matrix operations  Slow for large problem sizes  Octave  Freely available alternative to MATLAB  Part of the GNU project  Mimics syntax and semantics of MATLAB  Libraries of Octave differ to MATLAB libraries 2

3 Modern Parallel Architectures  The limits of performance of traditional single-core processors are reached.  Fundamental shift towards parallel architectures  Current popular parallel architectures:  Cell Processor (Sony, Toshiba and IBM)  Multi-core CPUs (Intel Core2 Series)  General Purpose GPUs (Nvidia Tesla)  Significant boost of performance  15 GFLOPs of a single core vs. 2 TFLOPs 3

4 The Cell Broadband Architecture  Parallel microprocessor architecture  Developed by Sony, Toshiba and IBM between 2000 and 2005  Used in the IBM Roadrunner – the worlds fastest supercomputer (Top500, > 1 PETAFLOP) 4

5 The Cell Broadband Architecture 5

6 Research Questions  How do we parallelise a matrix language program for modern parallel architectures? 6

7 Parallelising Matrix Languages  A) Translate code by hand  Concurrent programming is hard  Not trained in concurrent programming  Expensive/Time consuming  B) Automatically parallelise code  Our research 7

8 Parallel MATLAB  2003 survey found 27 Parallel MATLAB projects  Limitations  Targeted toward distributed parallel architectures  Varying degrees of intervention by the programmer required  Naive approach Only data parallelism of matrix operations exploited 8

9 PS 3 : Parallel Octave on the Cell  Our extension for the Octave interpreter  Minimal changes to existing Octave code for programmer  PS 3 exploits various parallelism in Octave programs:  Data parallelism: splitting matrices  Instruction level parallelism: execute independent matrix operations in parallel  Pipeline parallelism: Communication overlaps with computation  Task parallelism: concurrent execution of octave programs and matrix operations 9

10 Design 10

11 Octave Extension  Introduced a custom data type called ps3_matrix  To utilise our system convert matrices to ps3_matrix matrices 11 x = rand(100); y = rand(100); a = x + y; b = x.* y; c = a + b; disp(c); x = ps3_matrix(rand(100)); y = ps3_matrix(rand(100)); a = x + y; b = x.* y; c = a + b; disp(c); Original code Parallel code

12 Octave Extension  Lazy evaluation used to collect traces of operations whose result is not needed  Data dependence graph of these operations constructed 12 x = ps3_matrix(rand(100)); y = ps3_matrix(rand(100)); a = x + y; b = x.* y; c = a + b; disp(c); xyabc Source code Data Dependence Graph

13 Lowering  Lowering is triggered by evaluating a trace, e.g., disp(c)  Matrices are split into sub- matrices  Parallel computations of sub- matrices on SPUs  SPEs have 256KB local store  Splitting matrices is a necessity! 13 Lowered Data Dependence Graph xy ab c Data Dependence Graph x1x1 x2x2 x3x3 x4x4 y1y1 y2y2 y3y3 y4y4 a1a1 a2a2 a3a3 a4a4 b1b1 b2b2 b3b3 b4b4 c1c1 c2c2 c3c3 c4c4

14 Bubble Scheduler  Lowered operations are scheduled among the available processors  Want to schedule in a way that  Satisfies data dependencies between operations  Minimises the makespan of execution 14 Schedule Lowered Data Dependence Graph P1P2P3 a1a1 a2a2 a4a4 b1b1 b2b2 b4b4 c1c1 c2c2 c4c4 c3c3 Makespan x1x1 x2x2 x3x3 x4x4 y1y1 y2y2 y3y3 y4y4 a1a1 a2a2 a3a3 a4a4 b1b1 b2b2 b3b3 b4b4 c1c1 c2c2 c3c3 c4c4 X ILLEGAL a3a3 b3b3

15 Scheduler  The scheduling problem is NP-Hard  Finding an optimal solution takes too long to do at runtime  Designed a heuristic  Worst-case runtime complexity O(nlogn+m)  Earliest instruction is scheduled in first available stream  Designed an Integer Linear Program formulation  Gives optimal solution  Validate the precision of the heuristic 15

16 Computation Engine  Computation engine takes schedule and executes it on available processors  We implement a computation engine for the Cell Processor  Matrix Execution Units  Each SPE runs a small virtual machine for matrix operations  Features used for performance:  Double/Triple buffering  SIMD operations 16

17 Benchmarks  9 benchmark kernels chosen  Octave programs that involve many matrix operations  Include:  Computing Markov Chains  Computing the Discrete Fourier Transform of a signal  K-means clustering  Neural network training  Compared runtime of our system on a Cell processor with Intel Core2Quad processor  Q9950, 2.83 GHz 17

18 Results: Speedup vs. Intel Core2 Quad 18

19 Conclusion  New system presented for the automatic parallelisation of Octave code on the Cell Processor  Exploits several types of parallelism  Lazy evaluation to expose instruction level parallelism  Schedule operations on processors for maximal utilisation of parallel units  Results show significant speedups over Octave on more recent and more expensive Intel processors 19

20 Related Work 1. Sutter, H. The free lunch is over: A fundamental turn toward concurrency in software Dr. Dobb’s Journal, 2005, 30, Choy, R. & Edelman, A. Parallel MATLAB: Doing it Right Proceedings of the IEEE, 2005, 93, Fisher, J. Trace scheduling: A technique for global microcode compaction IEEE Transactions on Computers, 1981, 100, Kwok, Y.-K. & Ahmad, I. Static scheduling algorithms for allocating directed task graphs to multiprocessors ACM Comput. Surv., ACM, 1999, 31, Chen, T.; Raghavan, R.; Dale, J. N. & Iwata, E. Cell broadband engine architecture and its first implementation: a performance view IBM J. Res. Dev., IBM Corp., 2007, 51,