Active Harmony and the Chapel HPC Language Ray Chen, UMD Jeff Hollingsworth, UMD Michael P. Ferguson, LTS.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Chapter 17 vector and Free Store Bjarne Stroustrup
1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.
© 2005 by Prentice Hall Chapter 13 Finalizing Design Specifications Modern Systems Analysis and Design Fourth Edition Jeffrey A. Hoffer Joey F. George.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Chapter 11 Separate Compilation and Namespaces. Copyright © 2006 Pearson Addison-Wesley. All rights reserved Learning Objectives Separate Compilation.
Copyright © 2003 Pearson Education, Inc. Slide 6-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2002 Pearson Education, Inc. Slide 1.
CSE 105 Structured Programming Language (C)
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 5: Repetition and Loop Statements Problem Solving & Program.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
1 Interprocess Communication 1. Ways of passing information 2. Guarded critical activities (e.g. updating shared data) 3. Proper sequencing in case of.
1 Automating the Generation of Mutation Tests Mike Papadakis and Nicos Malevris Department of Informatics Athens University of Economics and Business.
Prasanna Pandit R. Govindarajan
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Active Harmonys Plug-in Interface A World of Possibilities Ray S. Chen Jeffrey K. Hollingsworth Paradyn/Dyninst Week 2013 April 30, 2013.
1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland
Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park
QA practitioners viewpoint
Database Performance Tuning and Query Optimization
The HDF Group Parallel HDF5 Developments 1 Copyright © 2010 The HDF Group. All Rights Reserved Quincey Koziol The HDF Group
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Chapter 24 Lists, Stacks, and Queues
Eiffel: Analysis, Design and Programming Bertrand Meyer (Nadia Polikarpova) Chair of Software Engineering.
Modern Programming Languages, 2nd ed.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Code Correctness, Readability, Maintainability Svetlin Nakov Telerik Corporation
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
1 Advanced C Programming from Expert C Programming: Deep C Secrets by Peter van der Linden CIS*2450 Advanced Programming Concepts.
Processes Management.
Executional Architecture
Reaching Agreements II. 2 What utility does a deal give an agent? Given encounter  T 1,T 2  in task domain  T,{1,2},c  We define the utility of a.
DB Relay An Introduction. INSPIRATION Database access is WAY TOO HARD The crux.
C Language.
Addition 1’s to 20.
Week 1.
Lilian Blot CORE ELEMENTS SELECTION & FUNCTIONS Lecture 3 Autumn 2014 TPOP 1.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14: More About Classes.
Copyright © 2012 Pearson Education, Inc. Chapter 14: More About Classes.
Introduction to Recursion and Recursive Algorithms
Copyright © 2009 by SDL Tridion. SDL Tridion®, SDL Tridion R5™, BluePrinting™, SiteEdit™ and WebForms™ are trademarks of SDL Tridion Holding B.V. or its.
Chapter 8 Improving the User Interface
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Parallel Programming in Java with Shared Memory Directives.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.
Single Node Optimization Computational Astrophysics.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
Parallel Algorithm Design
Many-core Software Development Platforms
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Active Harmony and the Chapel HPC Language Ray Chen, UMD Jeff Hollingsworth, UMD Michael P. Ferguson, LTS

Harmony Overview Harmony system based on feedback loop 2 Harmony Server Application Parameter Values Measured Performance

Simplex Algorithms Nelder-MeadParallel Rank Ordering 3

Tuning Granularity Initial Parameter Tuning o Application treated as a black box o Test parameters delivered during application launch o Application executes once per test configuration Internal Application Tuning o Specific internal functions or loops tuned o Possibly multiple locations within application o Multiple executions required to test configurations Run-time Tuning o Application modified to communicate with server mid-run o Only one run of the application needed 4

Example Application SMG2000 o 6-dimensional space o 3 tiling factors o 2 unrolling factors o 1 compiler choice o 20 search steps Performance gain o 2.37x for residual computation o 1.27x for on full application 5

The Irony of Auto-Tuning Intensely manual process o High cost of adoption Requires application specific knowledge o Tunable variable identification o Value range determination o Hotspot identification o Critical section modification at safe points Can auto-tuning be more automatic? 6

Towards Automatic Auto-tuning Reducing the burden on the end-user Three questions must be answered o What parameters are candidates for auto-tuning? o Where are the best code regions for auto-tuning? o When should we apply auto-tuning? 7

Our Goals Maximize return from minimal investment o Use profiling feature as a model o Should be enabled with a runtime flag o Aim to provide auto-tuning benefits within one execution Minimize language extension o Applications should be used as originally written Non-trivial goals with C/C++/Fortran o Are there any alternatives? 8

Chapel Overview Parallel programming language o Led by Cray Inc. o Chapel strives to vastly improve the programmability of large- scale parallel computers while matching or beating the performance and portability of current programming models like MPI. 9 Type of HW ParallelismProgramming ModelUnit of Parallelism Inter-nodeMPIexecutable Intra-node/multi-coreOpenMP/pthreadsiteration/task Instruction-level vectors/threadspragmasiteration GPU/acceleratorCUDA/OpenCL/OpenAccSIMD function/task Content courtesy of Cray Inc.

Chapel Methodology 10Content courtesy of Cray Inc.

Chapel Data Parallelism Only domains and forall loop requried o Forall loop used with arrays to distribute work o Domains used to control distribution o A generalization of ZPLs region concept 11Content courtesy of Cray Inc.

Chapel Task Parallelism Three constructs used to express control- based parallelism o begin – fire and forget o cobegin – heterogeneous tasks o coforall – homogeneous tasks 12 begin writeln(hello world); writeln(good bye); cobegin { consumer(1); consumer(2); producer(); } // wait here for all three tasks to complete begin producer(); coforall 1 in 1..numConsumers { consumer(i); } // wait here for all consumers to return Content courtesy of Cray Inc.

Chapel Locales MPI (SPMD) Functionality 13 writeln(start on locale 0); onLocales(1) do writeln(now on locale 1); writeln(on locale 0 again); proc main() { coforall loc in Locales do on loc do MySPMDProgram(loc.id, Locales.numElements); } proc MySPMDProgram(me, p) { println(Hello from node, me); } Content courtesy of Cray Inc.

Chapel Config Variables 14 config const numLocales: int; const LocaleSpace: domain(1) = [0..numLocales-1]; const Locales: [LocaleSpace] locale; % a.out --numLocales=4 Hello from node 3 Hello from node 0 Hello from node 1 Hello from node 2 Content courtesy of Cray Inc.

Leveraging Chapel Helpful design goals o Expressing parallelism and locality is the users responsibility o Not the compilers Chapel source effectively pre-annotated o Config variables help to locate candidate tuning parameters o Parallel looping constructs help to locate hotspots 15

Current Progress Harmony Client API ported to Chapel o Uses Chapels foreign function interface o Chapel client module to be added to next Harmony release Achieves the current state of auto-tuning o What to tune o Parameters must determined by a domain expert o Manually register each parameter and value range o Where to tune o Critical loop must be determined by a domain expert o Manually fetch and report performance at safe points o When to tune o Tuning enabled once manual changes are complete 16

Improving the What Leverage Chapels config variable type o Helpful for everybody to extend syntax slightly Not a silver bullet o False-positives and false-negatives definitely exist o Goes a long way towards reducing candidate variables o Chapel built-in candidate variables config const someArg = 5; 17 dataParTasksPerLocale dataParIgnoreRunningTasks dataParMinGranularity numLocales config const someArg = 5 in by 2;

Improving the Where Naïve approach o Modify all parallel loop constructs o Fetch new config values at loop head o Report performance at loop tail o Use PRO to efficiently search parameter space in parallel Poses open questions o How to know if config values are safe to modify mid-execution? o How to handle nested parallel loops? o How to prevent overhead explosion? Solutions outside the scope of this project o But weve got some ideas... 18

Whats Possible? Target pre-run optimization instead o Run small snippet of code pre-main o Determine optimal values to be used prior to execution Example: Cache optimization o Explore element size and stride o Pad array elements to fit size o Define domains o Automatically optimize for cache size and eviction strategy o Further increase performance portability Generate library of performance unit-tests o Bundle with Chapel for distribution 19

Improving the When Auto-tuning should be simple to enable o Use profiling as a model (just add –pg to the compiler flags) System should be self-reliant o Local server must be launched with application 20

Open Questions Automatic hotspot detection o Time spent in loop o Variables manipulated in loop o How to determine correctness-safe modification points o Static analysis? Moving to other languages o C/Fortran lacking needed annotations o More static analysis? Why avoid language extension? o Is it really so bad? 21