National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Slides:

Advertisements

Similar presentations

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 12 CPU Structure and Function. Example Register Organizations.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

Group May Bryan McCoy Kinit Patel Tyson Williams Advisor/Client: Zhao Zhang.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Group May Bryan McCoy Kinit Patel Tyson Williams.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Single Node Optimization Computational Astrophysics.

Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

CS203 – Advanced Computer Architecture

High performance bioinformatics

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Introduction to Parallelism.

Mapping the FFT Algorithm to the IBM Cell Processor

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

2 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/B.E. 1.PPU performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion

3 Introduction Our target MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm Our view of the MILC applications A sequence of communication and computation blocks compute loop 1 compute loop n MPI scatter/gather Original CPU-based implementation CPU MPI scatter/gather for loop 2 MPI scatter/gather for loop 3 compute loop 2 MPI scatter/gather for loop n+1

4 Introduction Cell/B.E. processor One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage 3.2 GHz processor 25.6 GB/s processor-to- memory bandwidth > 200 GB/s EIB sustained aggregate bandwidth Theoretical peak performance: GFLOPS (SP) and GFLOPS (DP)

5 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/B.E. 1.PPE performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion

6 Performance in PPE Step 1: try to run it in PPE In PPE it runs approximately ~2-3x slower than modern CPU MILC is bandwidth-bound It agrees with what we see with stream benchmark

7 Execution profile and kernels to be ported 10 of these subroutines are responsible for >90% of overall runtime All kernels responsible for 98.8%

8 Kernel memory access pattern Kernel codes must be SIMDized Performance determined by how fast you DMA in/out the data, not by SIMDized code In each iteration, only small elements are accessed Lattice size: 1832 bytes su3_matrix: 72 bytes wilson_vector: 96 bytes Challenge: how to get data into SPUs as fast as possible? Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes. Data layout in MILC meets neither of them #define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); } lattice site 0 Data accesses Data from neighbor One sample kernel from udadu_mu_nu() routine

9 Approach I: packing and unpacking Good performance in DMA operations Packing and unpacking are expensive in PPE PPE and main memory … struct site … Packing Unpacking DMA operations … SPEs

10 Approach II: Indirect memory access Replace elements in struct site with pointers Pointers point to continuous memory regions PPE overhead due to indirect memory access Original lattice Modified lattice Continuous mem … DMA operations SPEs … … PPE and main memory

11 Approach III: Padding and small memory DMAs Padding elements to appropriate size Padding struct site to appropriate size Gained good bandwidth performance with padding overhead Su3_matrix from 3x3 complex to 4x4 complex matrix 72 bytes  128 bytes Bandwidth efficiency lost: 44% Wilson_vector from 4x3 complex to 4x4 complex 98 bytes  128 bytes Bandwidth efficiency lost: 23% Original lattice Lattice after padding … SPEs … … DMA operations PPE and memory

12 Struct site Padding 128 byte stride access has different performance for different stride size This is due to 16 banks in main memory Odd numbers always reach peak We choose to pad the struct site to 2688 (21*128) bytes

13 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/BE. 1.PPU performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion

14 Kernel performance GFLOPS are low for all kernels Bandwidth is around 80% of peak for most of kernels Kernel speedup compared to CPU for most of kernels are between 10x to 20x set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x

15 Application performance Single Cell Application performance speedup ~8–10x, compared to Xeon single core Cell Blade application performance speedup x, compared to Xeon 2 socket 8 cores Profile in Xeon 98.8% parallel code, 1.2% serial code speedup slowdown 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell  PPE is standing in the way for further improvement 8x8x16x16 lattice 16x16x16x16 lattice

16 Application performance on two blades Execution time of the 54 kernels considered for the SPE implementation Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor) Total (seconds) Two Intel Xeon blades seconds 27.1 seconds (24.5 seconds due to MPI) seconds Two Cell/B.E. blades 15.9 seconds 67.9 seconds (47.6 seconds due to MPI) 83.8 seconds For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet More data needed for Cell blades connected through Infiniband

17 Application performance: a fair comparison 8x8x16x16 lattice16x16x16x16 lattice Intel Xeon time Cell/B.E. time speedup Intel Xeon time Cell/B.E. time speedup Single core Xeon vs. Cell/B.E. PPE Single core Xeon vs. Cell/B.E. PPE + 1 SPE Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs Xeon blade vs. Cell/B.E. blade PPE is slower than Xeon PPE + 1 SPE is ~2x faster than Xeon A cell blade is x faster than 8-core Xeon blade

18 Conclusion We achieved reasonably good performance Gflops in one Cell processor for whole application We maintained the MPI framework Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion Current site-centric data layout forces us to take the padding approach 23-44% efficiency lost for bandwidth Fix: field-centric data layout desired PPE slows the serial part, which is a problem for further improvement Fix: IBM putting a full-version power core in Cell/B.E. PPE may impose problems in scaling to multiple Cell blades PPE over Infiniband test needed