ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Kernel memory allocation
Thoughts on Shared Caches Jeff Odom University of Maryland.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Spark: Cluster Computing with Working Sets
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 I/O Management in Representative Operating Systems.
I/O Systems and Storage Systems May 22, 2000 Instructor: Gary Kimura.
FLANN Fast Library for Approximate Nearest Neighbors
Panda: MapReduce Framework on GPU’s and CPU’s
The hybird approach to programming clusters of multi-core architetures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
PMIT-6102 Advanced Database Systems
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Automatic LQCD Code Generation Quatrièmes Rencontres de la Communauté Française de Compilation December 5-7, 2011 Saint-Hippolyte (France) Claude Tadonki.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Strengthening deflation implementation for large scale LQCD inversions Claude Tadonki Mines ParisTech / LAL-CNRS-IN2P3 Review Meeting / PetaQCD LAL / Paris-Sud.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Single Node Optimization Computational Astrophysics.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Understanding Parallel Computers Parallel Processing EE 613.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Software Coherence Management on Non-Coherent-Cache Multicores
Ioannis E. Venetis Department of Computer Engineering and Informatics
Scaling Spark on HPC Systems
Improving cache performance of MPEG video codec
Jun Doi Tokyo Research Laboratory IBM Research
Page Replacement.
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
Department of Computer Science, University of Tennessee, Knoxville
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Key Computation Issues Large volume of data ( disk / memory / network ) Significant number of solvers iterations due to numerical intractability Redundant memory accesses coming from interleaving data dependencies Use of double precision because of accuracy need (hardware penalty) Misaligned data (inherent to specific data structures) Exacerbates cache misses (depending on cache size) Becomes a serious problem when consider accelarators Leads to « false sharing » with Shared-Memory paradigm (Posix, OpenMP) Padding is one solution but would dramatically increase memory requirement Memory/Computation compromise in data organization (e.g. gauge replication) ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Why the CELL Processor ? Highest computing power in a single « computing node » Fast memory access Asynchronysm between data transfers and computation Issues with the CELL Processor ? Data alignment (both for calculations and transfers) Heavy use of list DMA Small size of the Local Store (SPU local memory) Ressources sharing with Dual Cell Based Blade Integration into an existing standard framework ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

What we have done Implementation of each critical kernel on the CELL processor SIMD version of basic operators Appropriate DMA mechanism (efficient list DMA and double buffering) Merging of consecutive operations into a unique operator ( latency & memory reuse ) Aggregation of all these implementations into a single and standalone library Effective integration into the tmLQCD package Successful tests (QS20 and QS22) A single SPU thread holds the whole set of routines SPU thread remains « permanently » active during a working session Data re-alignment Routine calls replacement (invoke CELL versions in place of native ones) This should be the way to commit this back to tmLQCD (external library and « IsCELL » switch) ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Global Organization Task partitioning, distribution, and synchorization are done by the PPU Each SPE operates on its data portion by a typical loop of the form (DMA get + SIMD Computation + DMA put) The SPE, always active, switches to the appropriate operation on each request ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Optimal list DMA organization for the Wilson-Dirac Operator The computation of Wilson-Dirac action for a set of K contigous spinors required to get 8K spinors ( Example below with 32x16 3 lattice and even-odd ) S[0]P[2048]P[63488]P[128]P[1920]P[8]P[120]P[0]P[7] S[1]P[2049]P[63489]P[129]P[1921]P[9]P[121]P[1]P[0] S[2]P[2050]P[63490]P[130]P[1922]P[10]P[122]P[2]P[1] S[3]P[2051]P[63431]P[131]P[1923]P[11]P[123]P[3]P[2] A direct list DMA to get this « spinors matrix » involves 8x4 DMA items A list DMA to get the « transpose » involves = 9 DMA items Generally, our list DMA is of size 8 + c k instead of 8K ( bin packing ) No impact on SPU performance because of the uniform access to the LS Significant improvment in global performance and scalability ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Performance results We consider a 32x16 3 lattice and CELL-accerated version of tmLQCD QS20 #SPETime(s)SpeedupGFlops QS22 #SPETime(s)SpeedupGFlops INTEL i7 quadcore 2.83 Ghz Without SSEWith SSE 1 core4 cores1 core4 cores INTEL i7CELL (8 SPEs) SSE + 4cQS20QS22 GCR (57 iters)11.05 s3.78 s2.04 s CG (685 iters)89.54 s42.25 s22.78 s ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Comments We observed a factor 2 between QS20 and QS22 We observed a factor 4 between QS22 and Intel i7 quadcore 2.83 Ghz Good scalability on QS20 Scalability on QS22 is alterated beyond 4 SPEs (probably a binding issue on the Dual Cell Based Blade, which should be easy to fix) Fixing this scalability issue on QS22 will double actual performances ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

Ways for improvment Implement the « non GAUGE COPY » version ( significant memory reduction / packing ) Explore the SU(3) reconstruct approach at SPE level ( memory and bandwith savings ) Having the PPU participate in the calculations ( makes sens in double precision ) Try to scale up to the 16 SPEs on the QS22 Dual Cell Based Blade Experiment with a cluster of CELL processors ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010

END ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010 Two accepted conference/workshop publications International Conference on Supercomputing International Workshop on Highly Efficient Accelerators and Reconfigurable Technologies