A Coarray Fortran Implementation to Support Data-Intensive Application Development Deepak Eachempati 1, Alan Richardson 2, Terrence Liao 3, Henri Calandra.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
2 nd Microsoft Rotor Workshop, Pisa, April 23-25, SCOOPLI for.NET: a library for concurrent object-oriented programming Volkan Arslan, Piotr Nienaltowski.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
Reference: Message Passing Fundamentals.
Presented by Rengan Xu LCPC /16/2014
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Yuri Dotsenko Jason Lee EckhardtJohn Mellor-Crummey Department of.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
MPI3 Hybrid Proposal Description
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
University of Minnesota Comments on Co-Array Fortran Robert W. Numrich Minnesota Supercomputing Institute University of Minnesota, Minneapolis.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Full and Para Virtualization
Threaded Programming Lecture 2: Introduction to OpenMP.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Computing Presented by Justin Reschke
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
6/11/2018 Finding Oil with Cells: Seismic Imaging Using a Cluster of Cell Processors Michael Perrone IBM Master Inventor Mgr, Multicore Computing, IBM.
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Lecture 18: Coherence and Synchronization
17-Nov-18 Parallel 2D and 3D Acoustic Modeling Application for hybrid computing platform of PARAM Yuva II Abhishek Srivastava, Ashutosh Londhe*, Richa.
Introduction to parallelism and the Message Passing Interface
Lecture 25: Multiprocessors
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

A Coarray Fortran Implementation to Support Data-Intensive Application Development Deepak Eachempati 1, Alan Richardson 2, Terrence Liao 3, Henri Calandra 3, Barbara Chapman 1 Data-Intensive Scalable Computing Systems 2012 (DISCS’12) Workshop, November 16, Department of Computer Science, University of Houston 2 Department of Earth, Atmospheric, and Planetary Sciences, MIT 3 Total E&P 1DISCS'12 Workshop

Industry is looking for faster and more cost-effective ways to process massive amounts of data more powerful hardware more productive programming models innovative software techniques Oil and Gas Industry: Compute Needs 2DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 3DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 4DISCS'12 Workshop

Coarray Model in Fortran 2008 Derives from Co-Array Fortran (CAF) SPMD execution model, PGAS memory model – execution entities called images – coarrays: globally-accessible, symmetric data objects additional intrinsic subroutines/functions for querying process and data information additional statements in language for synchronization 5DISCS'12 Workshop

Working with Distributed Data using Coarrays ……………… … … … … … M 1234* real:: B[M, *] B references local B B[3,4] references local B B[3,3] references B in left neighbor 6DISCS'12 Workshop

Working with Distributed Data using Coarrays ……………… … … … … … M 1234* real:: B(10,10)[M, *] B(2:4,2:4) references local subarray of B B(2:4,2:4)[3,4] references local subarray of B B(2:4,2:4)[3,3] references subarray of B in left neighbor 7DISCS'12 Workshop

2D Halo Exchange Example with CAF real :: a(0:R+1, 0:C+1)[pR,*] … a(R+1,1)[top(1),top(2)] = a(1,1:C) a(0,1:C)[bottom(1),bottom(2)] = a(R,1:C) a(1:R,0)[right(1),right(2)] = a(1:R,C) a(1:R,C+1)[left(1),left(2)] = a(1:R,1) sync all 8DISCS'12 Workshop

2D Halo Exchange with MPI real :: a(0:R+1, 0:C+1) … call mpi_isend( a(1,1:C), C, mpi_real, & top(myp), TAG,...) call mpi_irecv( a(R+1,1:C), C, mpi_real, & bottom(myp), TAG,...) call mpi_isend( a(R,1:C), C, mpi_real, & bottom(myp), TAG,...) call mpi_irecv( a(0,1:C), C, mpi_real, & top(myp), TAG,...) call mpi_isend( a(1:R,C), R, mpi_real, & right(myp), TAG,...) call mpi_irecv( a(1:R,0), R, mpi_real, & left(myp), TAG,...) call mpi_isend( a(1:R,1), R, mpi_real, & left(myp), TAG,...) call mpi_irecv( a(C+1,1:R), R, mpi_real, & right(myp), TAG,...) call mpi_waitall( 8,...) 9DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 10DISCS'12 Workshop

Implementation of CAF OpenUH compiler – an industry-quality, optimizing compiler based on Open64 – features: dependence and data-flow analysis, interprocedural analysis, OpenMP – backend supports multiple targets (x86_64, IA64, IA32, MIPS, PTX) Fortran Front-End with coarray support CAF Source Code Coarray Translation Phase OpenUH CAF Runtime Library OpenUH CAF Runtime Library Loop Optimizer Global Optimizer Code Gen exec. OpenUH Compiler 11DISCS'12 Workshop

Runtime Support for CAF Runtime Interface (libcaf) 1-sided Communication PGAS Memory Allocation Synchronization Collectives Support (e.g. reductions) Atomics Portable Communication Substrate: GASNet or ARMCI 12DISCS'12 Workshop

Comparison with other Implementations CompilerCommercial/FreeFortran 2008 Coarray Support? OpenUHFreeYes G95Partially Free, No longer supported Missing Locks Support GfortranFreeIn progress Rice CAF 2.0FreePartially, but adds different features Cray FortranCommercialYes Intel FortranCommercialYes 13DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 14DISCS'12 Workshop

Seismic Subsurface Imaging: Reverse Time Migration A source wave is emitted per shot Reflected waves captured by array of sensors RTM (in time domain) uses finite difference method to numerically solve wave equation and reconstruct subsurface image (in parallel, with domain decomposition) 15DISCS'12 Workshop

RTM Implementations Isotropic – simplest model – assumes reflected waves propagate at same speed in every direction from a point – only swaps faces (8 swaps in halo exchange) Tilted Transverse Isotropy (TTI) – assumes waves may propagate at different speeds – swaps faces and edges (18 swaps in halo exchange) 16DISCS'12 Workshop

Typical Data Usage Generally several thousand shots – data parallel problem, where each shot can be processed independently in parallel – each shot handles several GB of data – so, total data to analyze is in terabytes range Handling I/O – C I/O reads in velocity and coefficient models – Shot headers read by master and distributed – Each processor writes to a distinct file, and file is merged in post-processing step 17DISCS'12 Workshop

Results for CAF RTM port Total Domain Size: 1024 x 768 x 512 (3.0 GB, per shot) Forward Shot Isotropic case: up to 32% faster compared to corresponding MPI implementation TTI case: competitive performance with MPI 18DISCS'12 Workshop

Results for CAF RTM port Total Domain Size: 1024 x 768 x 512 (3.0 GB, per shot) Backward Shot Isotropic case: performance hit at 256 procs TTI case: lagging a bit behind MPI 19DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 20DISCS'12 Workshop

Extending Fortran for Parallel I/O We are currently designing a prototype implementation for a parallel I/O language extension Fortran I/O was not yet extended to facilitate cooperative I/O to shared files – original Co-Array Fortran specified a simple extension to Fortran I/O – parallel I/O may be added in a future version of the standard 21DISCS'12 Workshop

Fortran I/O Fortran provides interfaces for formatted and unformatted I/O record 1 record 2 record 3 record 4 … open( 10, file=‘fn’, action=‘write’, & access=‘direct’, recl=k ) … write (10, rec=3) A open( 10, file=‘fn’, action=‘write’, & access=‘direct’, recl=k ) … write (10, rec=3) A A write file ‘fn’ connected to unit 10 22DISCS'12 Workshop

Current limitations of I/O Issues: 1.no defined, legal way for multiple images to access the same file 2.a file is a 1-dimensional sequence of records 3.records are read/written one at a time 4.no mechanism for collectives accesses to a shared file amongst multiple images 23DISCS'12 Workshop

Proposed Extension for Parallel I/O Allow a file to be “share-opened”, e.g. OPEN( 10, file=‘fn’, TEAM=‘yes’, …) – all images form a team with shared access to the same file – implicit synchronization recommended only for direct access mode FLUSH statement used to ensure changes by one image are visible to other images in team CLOSE statement has implicit image synchronization 24DISCS'12 Workshop

Further extensions we’re exploring Multi-dimensional view of records Read/write multiple records at a time Collective read/write operations on shared files 1,1 … open( 10, file=‘fn’, action=‘write’, & access=‘direct’, ndim=2, & dims=(/M/), team=‘yes’, recl=k ) … open( 10, file=‘fn’, action=‘write’, & access=‘direct’, ndim=2, & dims=(/M/), team=‘yes’, recl=k ) … file ‘fn’ connected to unit 10 1,21,3… 2,12,22,3… 3,13,23,3… 4,14,24,3… 5,15,25,3… M,1M,2M,3… 25DISCS'12 Workshop

Further extensions we’re exploring Multi-dimensional view of records Read/write multiple records at a time Collective read/write operations on shared files 1,1 … write (10, rec_lb=(/ 2,2 /), rec_ub=(/ 4,3 /) ) & A(1:4, 1:2) write (10, rec_lb=(/ 2,2 /), rec_ub=(/ 4,3 /) ) & A(1:4, 1:2) file ‘fn’ connected to unit 10 1,21,3… 2,1 2,2 2,3 … 3,1 3,2 3,3 … 4,1 4,2 4,3 … 5,15,25,3… M,1M,2M,3… A(1:4,1:2) write 26DISCS'12 Workshop

Further extensions we’re exploring Multi-dimensional view of records Read/write multiple records at a time Collective read/write operations on shared files 1,1 type(T) :: A(2,2)[3,*] … my_rec_lbs = get_rec_lbs( this_image() ) my_rec_ubs = get_rec_ubs( this_image() ) write_team( 10, rec_lb=my_rec_lbs, & rec_lb=my_rec_lbs) & A(:,:) type(T) :: A(2,2)[3,*] … my_rec_lbs = get_rec_lbs( this_image() ) my_rec_ubs = get_rec_ubs( this_image() ) write_team( 10, rec_lb=my_rec_lbs, & rec_lb=my_rec_lbs) & A(:,:) file ‘fn’ connected to unit 10 1,21,31,4 2,12,22,32,4 3,13,23,33,4 4,14,24,34,4 5,15,25,35,4 6,16,26,36,4 A(1:2,1:2)[1,1] A(1:2,1:2)[2,1] A(1:2,1:2)[1,2] A(1:2,1:2)[2,2] A(1:2,1:2)[3,1]A(1:2,1:2)[3,2] write_team 27DISCS'12 Workshop

Leverage Global Arrays as memory buffers for I/O Implementation in progress which utilizes global arrays (GA) as I/O buffers in memory I/O requests asynchronous disk updates compute nodes I/O nodes 28DISCS'12 Workshop

Outline Fortran 2008 parallel processing additions (CAF) CAF Implementation in OpenUH Fortran compiler Application port to CAF and Results Further extensions for Parallel I/O Closing Remarks 29DISCS'12 Workshop

In Summary Fortran coarray model may be used for processing large data sets Developed implementation that’s freely available and used it to develop RTM application Fortran’s I/O model doesn’t support parallel I/O for large-scale, multi-dimensional array data sets, and we are working on addressing this 30DISCS'12 Workshop

Thanks 31DISCS'12 Workshop