Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Sep 10, 2009 Serbian Participation in Grid Computing Projects D. Vudragović, A. Balaž, V. Slavnić, and A. Belić Scientific Computing Laboratory Institute.
Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.
MotoHawk Training Model-Based Design of Embedded Systems.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Introduction CS 524 – High-Performance Computing.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Digital signature using MD5 algorithm Hardware Acceleration
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Energy saving in multicore architectures Assoc. Prof. Adrian FLOREA, PhD Prof. Lucian VINTAN, PhD – Research.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.
LOGO A Convolution Accelerator for OR1200 Dawei Fan.
INFSO-RI Enabling Grids for E-sciencE SALUTE – Grid application for problems in quantum transport E. Atanassov, T. Gurov, A. Karaivanova,
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
QCAdesigner – CUDA HPPS project
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Programming on IBM Cell Triblade Jagan Jayaraj,Pei-Hung Lin, Mike Knox and Paul Woodward University of Minnesota April 1, 2009.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Single Node Optimization Computational Astrophysics.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.
Date of download: 6/1/2016 Copyright © 2016 SPIE. All rights reserved. Triangulated shapes of human head layer boundaries employed in simulations: (a)
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Joint Institute for Nuclear Research, Dubna, Russia
GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Lecture 5 – Improved Monte Carlo methods in finance: lab
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Multicore and GPU Programming
Presentation transcript:

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Sep 08, 2009 Overview Introduction SPEEDUP code Tested hardware architectures Results  Setup  Serial SPEEDUP code  MPI SPEEDUP code  Modified SPEEDUP code  Cell SPEEDUP code Comparison of hardware performance results Conclusions XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Introduction SPEEDUP code is used for numerical studies of Quantum Mechanical systems, properties of BECs and ultra-cold atomic gases Porting of the code enables its use on a broader set of computing resources Code optimization allows us to  Fully utilize computing resources  Eliminate bottlenecks in the code  Use different architectures in a proper way  But, it must be done carefully (verification!) Possibility of benchmarking of different hardware platforms Use results for planning of hardware upgrades XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 SPEEDUP code (1/2) Monte Carlo simulations are natural choice for numerical studies of relevant physical systems in the functional formalism – Path Integral Monte Carlo Speedup code calculates transition amplitudes using the effective action approach It is able to calculate partition functions and expectation values It can be also used to extract information about the low-lying energy spectra of quantum systems XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 SPEEDUP code (2/2) Algorithm Good RNG is essential – we use SPRNG XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009 Initialization -Variables -Allocate memory -Input parameters of the model -Number of time and MC steps -Random number generator seed Main MC loop -Generate trajectory -Calculate effective action -Accumulate results for each discretization level Finalizing -Gather and average results -Print results -Deallocate memory -Exit

Sep 08, 2009 Tested architectures IBM BladeCenter with 3 kinds of server systems in the HPC H-type chassis:  HX21XM blade Server Intel Xeon based 2 quadcore 5405 processors (SSE4.1) ICC and GCC compilers used  JS22 blade server POWER6 based 2 dualcore processors supporting multithreading and ALtiVec extensions IBM XLC/C++ and GCC compilers used  QS22 blade server Cell B/E architecture – 2 PowerXCells 8i on board 1 PowerPC Processor Element (PPE) 8 Synergetic Processing Elements (SPEs) IBM XL C/C++ Compiler for Multicore Acceleration and GCC compilers used XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Performed tests Nmc= MC samples boundary conditions for the transition amplitude  q(t=0)=0  q(t=T=1)=1  zero anharmonicity  level of effective action p=9 for the quartic anharmonic oscillator Same seed for SPRNG generator used for easy verification of the obtained results XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Serial SPEEDUP results Significant Increase in the speed when platform- specific compiler is used POWER performance dominates in this benchmark Cell is no match when only PPE is used (without the use of SPEs) XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009 Compiler GCCICCXLC Platform Intel (13760±50) s(10160±30) s POWER6 (17000±10) s(1900±10) s Cell (49410±50) s(14020±20) s

Sep 08, 2009 MPI SPEEDUP results Excellent scalability with the number of MPI processes Interesting behavior when the number of MPI processes ≥ 9 minimal execution time of 1320s XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Modified SPEEDUP results Implemented as a threaded version using POSIX threads (pthreads) Each thread calculates Nmc/Num_threads Intel has better relative increase in the speed (2.8x comparing to POWER6’s 1.3x) Best results: POWER6 250s, Intel 460s XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Cell SPEEDUP results (1/3) Heterogeneity of the architecture required the slight rearrangement of the code Same code is executed on all SPES Each SPE performs Nmc/Number_of_SPEs MC steps No SPRNG library for SPEs!!! Pthreads on PPE for control of SPEs and RNG generation DMA transfers of generated random trajectories from PPEs to SPEs Synchronization with mailbox technique XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Cell SPEEDUP results (2/3) Saturation of the performance around 4 SPEs caused by RNG Communication does not have significant impact on the execution time  Tested with RNG only, for verification Test result: 750s; ideal time: 250s XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Cell SPEEDUP results (3/3) To fully utilize all SPE capabilities, one has to extend SPE calculation time  Increase in the effective action level p  We demonstrate this by compiling the code without optimization perfect scaling when PPEs have enough time for RNG XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009

Sep 08, 2009 Comparison of results Results for Intel and POWER6 are obtained using modified SPEEDUP code Cell ideal time corresponds to the full utilization of SPEs (estimated) XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009 IntelPOWER6CellCell ideal 460s250s750s250s

Sep 08, 2009 Conclusions POWER6 and Intel optimization is done using threaded version of the code Cell platform requires more complex changes of the code platform-specific compilers always give much better performance SPEEDUP easily optimized on the POWER6 platform, with superior performance Good performance and scalability for Intel platform Same level of performance as POWER6 with higher calculation times for Cell Future work: porting of SPRNG library to SPEs and implementation of platform-specific instructions (vectorization) for each tested platform XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, September, 2009