Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia.

Slides:



Advertisements
Similar presentations
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Advertisements

Streaming SIMD Extension (SSE)
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
School of Engineering & Technology Computer Architecture Pipeline.
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Optimizing single thread performance Dependence Loop transformations.
Parallel computer architecture classification
1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
– Project B Spring 2007 Grid Computing “CompuNet” Instructor: Zvika Berkovich Lab Chief Engineer: Dr. Ilana David Students: Keren Kotlovsky & Milena.
Superscalar Processors (Pictured above is the DEC Alpha 21064) Presented by Jeffery Aguiar.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Panel: BlueGene/L The Next 100 weeks BlueGene/L Workshop Watson, 6-Feb-04 IBM Life Sciences – WW Acad-Govt Robert A. Eades.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Drew Freer, Beayna Grigorian, Collin Lambert, Alfonso Roman, Brian Soumakian.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
Processor Architecture
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Enhancing Commodity Scalar Processors with Vector Components for Increased Scientific Productivity.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
Vector and symbolic processors
Visual Parameter Exploration in GPU Shader Space Peter Mindek 1, Stefan Bruckner 2,1, Peter Rautek 3, and M. Eduard Gröller 1 1 Institute of Computer Graphics.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Pipeline Processing Environment Michael J Pan. Motivation The algorithms have been implemented The algorithms have been implemented Possibly located on.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
Evolution of successful Forum for Computational Excellence (FCE) Pilot project – raising awareness for HEP response to rapid evolution of the computational.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Lecture 10 CUDA Instructions
Turing Lecture External Version.ppt
Using BLIS Building Blocks:
A Common Machine Language for Communication-Exposed Architectures
Array Processor.
Jun Doi Tokyo Research Laboratory IBM Research
STUDY AND IMPLEMENTATION
Coe818 Advanced Computer Architecture
Using BLIS Building Blocks:
PASC PASCHA Project The next HPC step for the COSMO model
EE 4xx: Computer Architecture and Performance Programming
General Purpose Graphics Processing Units (GPGPUs)
CS 286 Computer Organization and Architecture
Extra Reading Data-Instruction Stream : Flynn
What I've done in past 6 months
Real time signal processing
6- General Purpose GPU Programming
Presentation transcript:

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia Collaborators: Jed Brown, Dr. John Gunnels King Abdullah University of Science and Technology November 2011

Motivation PowerPC 450: a representation to exascale architectures –Increased parallelism: vectorization and multi-issue pipeline –Silicon and power savings: in-order execution Streaming numerical kernels: –At the heart of many scientific applications –Bottleneck in scientific codes 2 7-point stencil operator27-point stencil operator

Why is tuning computation on the BG/P PowerPC 450 difficult? Utilizes features to improve efficiency –SIMDized fused floating point units B N A N For (i=0; i<N; i++) A[i] = B[i] + B[i+1] Not Aligned 3

Why is tuning computation on the BG/P PowerPC 450 difficult? Utilizes features to improve efficiency –SIMDized fused floating point units –Superscalar processor with In-order execution at the core level 4 CycleLoad unitFP unit 1load Aadd B 2load C- 3load D- 4-add D 5-add E 6-add F 1 load A 2 add B 3 load C 4 load D 5 add D 6 add E 7 add F CycleLoad unitFP unit 1load Aadd B 2load Cadd E 3load Dadd F 4-add D load A 2 add B 3 load C 6 add E 4 load D 7 add F 5 add D

Engineering tactics Divide and conquer: 3-point stencil –Optimize then replicate into larger stencils Design focus: computer architecture –Fully utilize SIMD capabilities –Reduce pipeline stalls: unroll-and-jam and instructions interleaving (reordering) Technique: assembly synthesis in Python –Accelerates prototyping –Simplifies source 5

3-point stencil SIMDization Utilizing the SIMD-like unit features: A ij R ij k W A R r3 = a2*W0 + a3*W1 + a4*W2 Primary | Secondary 6 Regular SIMDCrossCopy-primary And more … W A R Primary | Secondary W A R

Mutate-mutate Vs. load-copy Kernel OperationsCyclesRegistersUtilization % Ld-stFPUld/stFPUInputOutputld/stFPU Mutate-mutate Load-copy Mutate-mutate –Fully utilizes the FPU –Requires less registers Load-copy –Requires less load cycles

Unroll-and-jam reduce data hazards For (i=0; i<4; i++) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] For (i=0; i<4; i+=2) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] A[i+1] += q*B[i+1][j] + p*B[i+1][j+1] A B j = 2 sources, 1 destinations 2 sources, 2 destinations A[0] += q*B[0] A[1] += q*B[6] A[0] += p*B[1] A[1] += p*B[7]. A[0] += q*B[0] stall A[0] += p*B[1] stall A[0] += q*B[2] stall A[0] += p*B[3]. A B j = 8

Unroll-and-jam data reuse j R (i,j) R (i,j+1) i 3 R (i+1,j) R (i+1,j+1) 4 Jam 1 Jam 2 Jam 3 Jam 4 j w1w2w1w3w2w3 2 w4 w5w4w6w5 w6 i w1 w2w1w3w2 w3 3 w7 w8w7w9w8 w9 w4 w5w4w6w5 w6 4 w7 w8w7w9w8 w9 9

Pythonic code synthesis overview Instruction scheduler and simulator Documented C code template PowerPC 450 simulator C code generator GPR FPR Memory Simulation log and debugging information Python code Instructions (list of objects) Register allocation 10

Pythonic code synthesis instruction scheduling Goal: –Run load/store and FMA instructions each cycle –Reduce read-after-write (RAW) data dependency hazards Technique (Greedy) per cycle: –Create a list of instructions with no RAW hazards –Execute the instruction(s) that will require the minimal stall –Repeat until all instructions are executed 11

Unroll-and-jam effects 27-point stencil 12

Kernel and L2 effects 7-point stencil 13

Unroll-and-jam effects 3-point stencil 14

Instruction scheduling optimization formulation 15

Conclusion SIMDizing the computations of streaming numerical kernels is challenging Assembly programming is important for “peak” hardware utilization We introduced a code synthesis and simulation framework that facilitates: –A faster development-testing loop –Instruction reordering for improved efficiency –Cycle-accurate performance modeling 16