A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA and Multicore Sunita Chandrasekaran1 Oscar.

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM B. Bershad, S. Savage, P. Pardyak, E. G. Sirer, D. Becker, M. Fiuczynski, C. Chambers,

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

ATOM: A System for Building Customized Program Analysis Tools.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Presented by Rengan Xu LCPC /16/2014

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Automatic Generation of Parallel OpenGL Programs Robert Hero CMPS 203 December 2, 2004.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

May 6, 2006 Vermelding onderdeel organisatie 1 Intent and Interaction EGVE 2006, Lisboa, Portugal Gerwin de Haan MediaMatics, Computer Graphics & CAD/CAM.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Introduction Hardware accelerators, such as General-Purpose Graphics Processing Units (GPGPUs), are promising parallel platforms for highperformance computing.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

AICS Café – 2013/01/18 AICS System Software team Akio SHIMADA.

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

DEVSView: A DEVS Visualization Tool Wilson Venhola.

CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Static Program Analysis of Embedded Software Ramakrishnan Venkitaraman Graduate Student, Computer Science Advisor: Dr. Gopal Gupta

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Overview and Comparison of Software Tools for Power Management in Data Centers Msc. Enida Sheme Acad. Neki Frasheri Polytechnic University of Tirana Albania.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

FPGA Hardware Synthesis Jessica Baxter. Reference M. Haldar, A. Nayak, N. Shenoy, A. Choudhary and P. Banerjee, “FPGA Hardware Synthesis from MATLAB”,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Single Node Optimization Computational Astrophysics.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Parallel Programming By J. H. Wang May 2, 2017.

Performance Analysis, Tools and Optimization

Many-core Software Development Platforms

Department of Computer Science University of California, Santa Barbara

Compiler Back End Panel

Compiler Back End Panel

Alternative Processor Panel Results 2008

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Department of Computer Science, University of Tennessee, Knoxville

HPC User Forum: Back-End Compiler Technology Panel

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event

2 Motivation 1. Related Work 2. Array Analysis Techniques 3. Array Analysis Module in OpenUH 4. Our Integrated System 5. Outline

3 6. Dragon Tool 7. Conclusion 8. Future work Outline

Motivation 4 B Reduce Data movement A Identify and fix inefficiencies in defining arrays D Enhance analyzing code C Identify auto-parallelization opportunities

Parallelization/Reduce Data Movement sdfs Host Main Memory Application data sdfs GPU GPU Memory Application data Host cores GPU cores A[lb:ub] 5 !$acc region copyin(A(1:100,1:100))

Access Density/Array Region DEF USE start Declare char A[20] for i = 0 to 19 A[i] = … ………. for i = 0 to 10 … = A[i] for i = 10 to 15 … = A[i] ………. for i = 10 to 15 … = A[i] ………. for i = 15 to 17 … = A[i] end 4 times at diff positions AccessDensity Region 6

Related Work B Par4All compiler tackles data transfer management between host and accelerator using array regions analysis. A PGI accelerator compiler applies array region analysis to reduce memory transfers D C CAPO depends on interprocedural data dependence info to insert compiler directives to facilitate parallelism E Dragon was previously developed with some limitations HPM toolkit, PAPI, and OProfile provide facilities to instrument programs, record HWC data, and analyze results. F Array Regrouping was targeted. 7

Array Access Analysis Techniques 8 B Importance for optimizations in parallel compiler A What is Array Region Analysis? C It is usually impractical to simply list elements referenced

Array Access Analysis Techniques Methods in term of efficiency and precision: Triplet-based (RS) Linear-based (Region) Reference- based(Atom) Precision Efficiency Classic 9

Our Integrated System HPC Application ARA Module HL-Whirl-Tree Dragon Array Analysis Graph Lowering.rgn file OpenUH IPA Phase Extension 10

Dragon Array Analysis Graph 11

Dragon Call Graph for NAS LU Benchmark 12

Dragon Array Graph for NAS LU Benchmark 13

Dragon Array Graph for NAS LU Benchmark 14

Conclusion 15 B We show that this information can be critical and crucial for a better parallelization, cache and memory utilization. A We unfold an interactive tool to find the hotspot portions of interprocedural arrays in HPC applications. C Reduce data transfers by exploiting the sub-array offloading functionality supported by D-B GPU programming models. D Our tool has been tested on some HPC benchmarks.

Future Work 16 B Extend our array analysis tool to support the analysis and visualization of remote array accesses in PGAS context A Combine Array Analysis and Data Dependency modules in OpenUH to enhance memory and cache utilization C Enrich our tool’s features by supporting high performance 3D visualization via Qt OpenGL module

Bibliography [1] P. Group. (2008) Pgi compilers, gpus and you! pgi presentation sc08.pdf. [Online]. Available: [2] M. Amini, F. Coelho, F. Irigoin, and R. Keryell, “Static compilation analysis for host- accelerator communication optimization,” in The 24 th International Workshop on Languages and Compilers for Parallel Computing, Fort Collins, Colorado, Sep [3] (2001) Code parallelization with capo – a user manual. [Online]. Available: [4] (2008) Hardware performance monitor(hpm) toolkit users guide. [Online]. Available: ug.pdf [5] P. J. Mucci, S. Browne, C. Deane, and G. Ho. (1999, Sep.) Papi: A portable interface to hardware performance counters. dodugc99-papi.pdf. [Online]. Available: mucci/latest/pubs/ 17

Bibliography [6] W. E. Cohen. (2004) Tuning programs with oprofile. Oprofile.pdf. [Online]. Available: [7] O. Hernandez, C. Liao, and B. Chapman, “Dragon: A static and dynamic tool for openmp,” in In Workshop on OpenMP Applications and Tools (WOMPAT 2004), 2005, pp. 53–66. [8] A. Qawasmeh, B. Chapman, and A. Banerjee, “A Compiler-Based Tool for Array Analysis in HPC Applications,” In Proceedings of the 41st International Conference on Parallel Computing Workshops, Pittsburgh, PA, USA, Sep. 2012, pp. 454–463. [9] X. Shen, Y. Gao, C. Ding, and R. Archambault, “Lightweight reference affinity analysis,” in In Proceedings of the 19th ACM International Conference on Supercomputing, Boston, MA, USA, Jun. 2005, pp. 131–140. [10] (2012) High Performance Computing and Tools Research Group. [Online]. Available: