Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.

Slides:



Advertisements
Similar presentations
The HPEC Challenge Benchmark Suite
Advertisements

Introduction to Grid Application On-Boarding Nick Werstiuk
NGS computation services: API's,
Parallel Matlab Vikas Argod Research Computing and Cyberinfrastructure
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
COURSE: COMPUTER PLATFORMS
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Parallel ISDS Chris Hans 29 November 2004.
Andrew Prout, William Arcand, David Bestor, Chansup Byun, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Peter Michaleas, Julie Mullen, Albert Reuther,
Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Slide-1 Portal DR&E LLGrid Portal Interactive Supercomputing for DoD Albert Reuther, William Arcand, Chansup Byun, Bill Bergeron, Matthew Hubbell, Jeremy.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.
XYZ 11/13/2015 MIT Lincoln Laboratory 300x Matlab Dr. Jeremy Kepner MIT Lincoln Laboratory September 25, 2002 HPEC Workshop Lexington, MA This.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
UNIX Unit 1- Architecture of Unix - By Pratima.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Introduction Why are virtual machines interesting?
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
© 2008 IBM Corporation IBM Portfolio Update IDC User Forum John S. Zuk WW Portfolio & Strategy; Deep Computing.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
10/2/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Introduction to Operating Systems Concepts
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Matlab programming using Distributed Arrays
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Density-based Hybrid Clustering
NGS computation services: APIs and Parallel Jobs
Parallel Processing in ROSA II
Jun Doi Tokyo Research Laboratory IBM Research
Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:
Multicore Programming in pMatlab using Distributed Arrays
HPML Conference, Lyon, Sept 2018
Hybrid Programming with OpenMP and MPI
Presentation transcript:

Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Slide 2 MIT Lincoln Laboratory Outline What is Parallel Matlab (pMatlab) IBM Blue Gene/P System BG/P Application Paths Porting pMatlab to BG/P Introduction Performance Studies Optimization for Large Scale Computation Summary

Slide 3 MIT Lincoln Laboratory Library Layer (pMatlab) Parallel Matlab (pMatlab) Vector/Matrix Comp Task Conduit Application Parallel Library Parallel Hardware Input Analysis Output User Interface Hardware Interface Kernel Layer Math (MATLAB/Octave) Messaging (MatlabMPI) Layered Architecture for parallel computing Kernel layer does single-node math & parallel messaging Library layer provides a parallel data and computation toolbox to Matlab users

Slide 4 MIT Lincoln Laboratory IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K

Slide 5 MIT Lincoln Laboratory Blue Gene Application Paths Serial and Pleasantly Parallel Apps Highly Scalable Message Passing Apps High Throughput Computing (HTC) High Performance Computing (MPI) Blue Gene Environment High Throughput Computing (HTC) –Enabling BG partition for many single-node jobs –Ideal for “pleasantly parallel” type applications

Slide 6 MIT Lincoln Laboratory HTC Node Modes on BG/P Symmetrical Multiprocessing (SMP) mode –One process per compute node –Full node memory available to the process Dual mode –Two processes per compute node –Half of the node memory per each process Virtual Node (VN) mode –Four processes per compute node (one per core) –1/4 th of the node memory per each process

Slide 7 MIT Lincoln Laboratory Porting pMatlab to BG/P System Requesting and booting a BG partition in HTC mode –Execute “qsub” command Define number of processes, runtime, HTC boot script ( htcpartition --trace 7 --boot --mode dual \ --partition $COBALT_PARTNAME ) Wait for the partition ready (until the boot completes) Running jobs –Create and execute a Unix shell script to run a series of “submit” commands including submit -mode dual -pool ANL-R00-M1-512 \ -cwd /path/to/working/dir -exe /path/to/octave \ -env LD_LIBRARY_PATH=/home/cbyun/lib \ -args “--traditional MatMPI/MatMPIdefs523.m” Combine the two steps eval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’))

Slide 8 MIT Lincoln Laboratory Outline Single Process Performance Point-to-Point Communication Scalability Introduction Performance Studies Optimization for Large Scale Computation Summary

Slide 9 MIT Lincoln Laboratory Performance Studies Single Processor Performance –MandelBrot –ZoomImage –Beamformer –Blurimage –Fast Fourier Transform (FFT) –High Performance LINPACK (HPL) Point-to-Point Communication –pSpeed Scalability –Parallel Stream Benchmark: pStream

Slide 10 MIT Lincoln Laboratory Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release HPLMandelBrotZoomImage*Beamformer Blurimage* FFT Lower is better 26.4 s 11.9 s18.8 s6.1 s1.5 s 5.2 s

Slide 11 MIT Lincoln Laboratory Octave Performance With Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS

Slide 12 MIT Lincoln Laboratory Single Process Performance: Stream Benchmark Higher is better Triad a = b + q c Add c = a + c Scale b = q c 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab

Slide 13 MIT Lincoln Laboratory Point-to-Point Communication Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 pMatlab example: pSpeed –Send/Receive messages to/from the neighbor. –Messages are files in pMatlab.

Slide 14 MIT Lincoln Laboratory Filesystem Consideration A single NFS-shared disk (Mode S) Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 A group of cross-mounted, NFS-shared disks to distribute messages (Mode M)

Slide 15 MIT Lincoln Laboratory pSpeed Performance on LLGrid: Mode S Higher is better

Slide 16 MIT Lincoln Laboratory pSpeed Performance on LLGrid: Mode M Lower is better Higher is better

Slide 17 MIT Lincoln Laboratory pSpeed Performance on BG/P BG/P Filesystem: GPFS

Slide 18 MIT Lincoln Laboratory pStream Results with Scaled Size SMP mode: Initial global array size of 2^25 for Np=1 –Global array size scales proportionally as number of processes increases (1024x1) 563 GB/sec 100% Efficiency at Np = 1024

Slide 19 MIT Lincoln Laboratory pStream Results with Fixed Size Global array size of 2^30 –The number of processes scaled up to (4096x4) VN: TB/Sec 101% efficiency at Np=16384 DUAL: TB/Sec 96% efficiency at Np=8192 SMP: TB/Sec 98% efficiency at Np=4096

Slide 20 MIT Lincoln Laboratory Outline Aggregation Introduction Performance Studies Optimization for Large Scale Computation Summary

Slide 21 MIT Lincoln Laboratory Current Aggregation Architecture The leader process receives all the distributed data from other processes. All other processes send their portion of the distributed data to the leader process. The process is inherently sequential. –The leader receives Np-1 messages Np = 8 Np: total number of processes

Slide 22 MIT Lincoln Laboratory Binary-Tree Based Aggregation BAGG: Distributed message collection using a binary tree –The even numbered processes send a message to its odd numbered neighbor –The odd numbered processes receive a message from its even numbered neighbor Maximum number of message a process may send/receive is N, where Np = 2^(N)

Slide 23 MIT Lincoln Laboratory BAGG() Performance Two dimensional data and process distribution Two different file systems are used for performance comparison –IBRIX: file system for users’ home directories –LUSTRE: parallel file system for all computation IBRIX: 10x faster at Np=128 LUSTRE: 8x faster at Np=128

Slide 24 MIT Lincoln Laboratory BAGG() Performance, 2 Four dimensional data and process distribution With GPFS file system on IBM Blue Gene/P System (ANL’s Surveyor) –From 8 processes to 1024 processes 2.5x faster at Np=1024

Slide 25 MIT Lincoln Laboratory Generalizing Binary-Tree Based Aggregation HAGG: Extend the binary tree to the next power of two number –Suppose that Np = 6 The next power of two number: Np* = 8 –Skip any messages from/to the fictitious Pid’s

Slide 26 MIT Lincoln Laboratory BAGG() vs. HAGG() HAGG() generalizes BAGG() –Removes the restriction (Np = 2^N) in BAGG() –Additional costs associated with bookkeeping Performance comparison on two dimensional data and process distribution ~3x faster at Np=128

Slide 27 MIT Lincoln Laboratory BAGG() vs. HAGG(), 2 Performance comparison on four dimensional data and process distribution Performance difference is marginal on a dedicated environment –SMP mode on IBM Blue Gene/P System

Slide 28 MIT Lincoln Laboratory BAGG() Performance with Crossmounts Significant performance improvement by reducing resource contention on file system –Performance is jittery because production cluster is used for performance test Lower is better

Slide 29 MIT Lincoln Laboratory Summary pMatlab has been ported to IBM Blue Gene/P system Clock-normalized, single process performance of Octave on BG/P system is on-par with Matlab For pMatlab point-to-point communication (pSpeed), file system performance is important. –Performance is as expected with GPFS on BG/P Parallel Stream Benchmark scaled to processes Developed a new pMatlab aggregation function using a binary tree to scale beyond 1024 processes