STAPL The C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

TAXI Code Overview and Status Timmie Smith November 14, 2003.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu Faculty Mentor: Dr. Nancy Amato Supervisor: Dr. Mauro Bianco.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

PContainerARMI Communication Library Oil well logging simulation MPIOpenMPPthreadsNative pAlgorithmspContainers User Application Code pRange STAPL Overview.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Adaptive MPI Milind A. Bhandarkar

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.

Intro to the C++ STL Timmie Smith September 6, 2001.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel Computing Presented by Justin Reschke

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

A Parallel Communication Infrastructure for STAPL

TensorFlow– A system for large-scale machine learning

Parallel Programming By J. H. Wang May 2, 2017.

Resource Elasticity for Large-Scale Machine Learning

Computer Engg, IIT(BHU)

Department of Computer Science University of California, Santa Barbara

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

GENERAL VIEW OF KRATOS MULTIPHYSICS

Gary M. Zoppetti Gagan Agrawal

Immersed Boundary Method Simulation in Titanium Objectives

Presented By: Darlene Banta

Chapter 01: Introduction

Parallel Programming in C with MPI and OpenMP

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Presentation transcript:

STAPL The C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger

2Texas A&M UniversitySTAPL Motivation Building block library – Nested parallelism Inter-operability with existing code – Superset of STL Portability and Performance – Layered architecture – Run-time adaptivity STAPL – C++ Standard Template Adaptive Parallel Library

3Texas A&M UniversitySTAPL Philosophy Interface Layer – STL compatible Concurrency & Communication Layer – Generic parallelism, synchronization Software Implementation Layer – Instantiates concurrency & communication Machine Layer – Architecture dependent code

4Texas A&M UniversitySTAPL Related Work STAPLAVTLCHARM++CHAOS++CILK * NESL*POOMAPSTLSPLIT C * Paradigm SPMD/ MIMD SPMDMIMDSPMDSPMD/ MIMD SPMD Architecture shared/ dist distshared/ dist shared/ dist shared/ dist shared/ dist shared/ dist Nested Parallelism yesno yes no yes Adaptive yesno Generic yes Yes (limits)yesnoyes no Irregular yesnoyes noyes Data Decomp. auto/ user autouserauto/ user auto/ user Data Mapping auto/ user auto auto/ user auto userauto/ user auto Scheduling user, static, dyn, block user, MPI- based prioritized execution data decom. based work stealing work & depth model pthread scheduling Tulip RTS user Comm/Comp Overlap yesnoyesno yes * Parallel programming language

5Texas A&M UniversitySTAPL STL Overview Containers Data is stored in Containers Algorithms STL provides standardized Algorithms Iterators Iterators bind Algorithms to Containers – are generalized pointers Example Container AlgorithmIterator vector vect; … // initialization of ‘vect’ variable sort(vect.begin(),vect.end());

6Texas A&M UniversitySTAPL STAPL Overview pContainers Data is stored in pContainers pAlgorithms STAPL provides standardized pAlgorithms pRanges pRanges bind pAlgorithms to pContainers Similar to STL Iterators, but must also support parallelism

7Texas A&M UniversitySTAPL pRange pRange is the Parallel Counterpart of STL iterator: – – Binds pAlgorithms to pContainers – – Provides an abstract view of a scoped data space – – data space is (recursively) partitioned into subranges More than an iterator since it supports parallelization – – Scheduler/distributor decides how computation and data structures should be mapped to the machine – – Data dependences among subranges can be represented by a data dependence graph (DDG) – – Executor launches parallel computation, manages communication, and enforces dependences

8Texas A&M UniversitySTAPL pRange Provides random access to a partition of the data space –View and access provided by a collection of iterators describing pRange boundary pRanges are partitioned into subranges – Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc. – Manually according to user-specified partitions pRange can represent relationships among subspaces as Data Dependence Graphs (DDG) ( for scheduling )

9Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

10Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

11Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism Prange Data Space subspace stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

12Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

13Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

14Texas A&M UniversitySTAPL pRange Each subspace is disjoint and could be itself a pRange – Nested parallelism subspace Prange Data Space stapl::pRange ::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange ::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like ( * size);

15Texas A&M UniversitySTAPL pContainer pContainer is the parallel counterpart of STL container – Provides parallel and concurrent methods Maintains internal pRange – Updated during insert/delete operations – Minimizes redistribution Completed: pVector, pList, pTree Example: pVector pRange STL vector

16Texas A&M UniversitySTAPL pAlgorithm pAlgorithm is the parallel counterpart of STL algorithm Parallel Algorithms take as input – pRange – Work functions that operate on subRanges and apply the work function to all subranges template class pAddOne : public stapl::pFunction { public:... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ } }... p_transform(pRange, pAddOne);

17Texas A&M UniversitySTAPL Run-Time System Support for different architectures – HP V2200 – SGI Origin 2000, SGI Power Challenge Support for different paradigms – OpenMP, Pthreads – MPI Memory allocation – HOARD Run-Time Cluster 1 Proc 12 Proc 14 Proc 13 Proc 15 Cluster 4 pAlgorithm Cluster 2Cluster 3

18Texas A&M UniversitySTAPL Run-Time System Scheduler – Determine an execution order (DDG) – Policies: Automatic :Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling User defined Distributor – Hierarchical data distribution – Automatic and user defined Executor – Execute DDG Processor assignment Synchronization and Communication

19Texas A&M UniversitySTAPL STL to STAPL Automatic Translation C++ preprocessor converts STL code into STAPL parallel code Iterators used to construct pRanges User is responsible for safe parallelization #include accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo()); #include pi_accumulate(x.begin(), x.end(), 0); pi_for_each(x.begin(), x.end(), foo()); p_accumulate(x_pRange, 0); p_for_each(x_pRange,foo()); Preprocessing phase pRange construction In some cases automatic translation provides similar performance to STAPL written code (5% deterioration)

20Texas A&M UniversitySTAPL Performance: p_find Experimental results on HP V2200

21Texas A&M UniversitySTAPL Performance: p_inner_product Experimental results on HP V2200

22Texas A&M UniversitySTAPL pTree Parallel Tree supports bulk commutative operations in parallel Each processor is assigned a set of subtrees to maintain Operations on the base are atomic Operations on subtrees are parallel Base (atomic) Subtrees (parallel) P1 P2 P3 Example : Parallel Insertion Algorithm Each processor is given a set of elements 1) 1)Each proc creates local buckets corresponding to the subtrees 2) 2)Each processor collects the buckets that correspond to its subtrees 3) 3)Elements in the subtree buckets are inserted into tree in parallel

23Texas A&M UniversitySTAPL pTree Basis for STAPL pSet, pMultiSet, pMap, pMultiMap containers – Covers all remaining STL containers Results are sequentially consistent although internal structure may vary Requires negligible additional memory pTrees can be used either sequentially or in parallel in the same execution – allows switching back and forth between parallel & sequential

24Texas A&M UniversitySTAPL Performance: pTree Experimental results on HP V2200

25Texas A&M UniversitySTAPL Algorithm Adaptivity Problem - Parallel algorithms are highly sensitive – Architecture – number of processors, memory interconnection, cache, available resources, etc – Environment – thread management, memory allocation, operating system policies, etc – Data Characteristics – input type, layout, etc Solution - implement a number of different algorithms and adaptively choose the best one at run-time

26Texas A&M UniversitySTAPL Adaptive Framework

27Texas A&M UniversitySTAPL Case Study - Adaptive Sorting SortStrengthWeakness ColumnTheoretically time optimal Many passes over data MergeLow memory overhead Poor scalability RadixExtremely fastIntegers only SampleTwo passes over data High memory overhead

28Texas A&M UniversitySTAPL Performance: Adaptive Sorting V2200 Power Challenge Origin 2000 Performance on 10 million integers

29Texas A&M UniversitySTAPL Performance: Run-Time Tests if (data_type = INTEGER) radix_sort(); else if (num_procs < 5) merge_sort(); else column_sort(); Origin 2000

30Texas A&M UniversitySTAPL Performance: Molecular Dynamics* Discrete time particle interaction simulation – Written in STL – Time steps calculate system evolution (dependence) – Parallelized within time step STAPL utilization: – pAlgorithms: p_for_each, p_transform, p_accumulate – pContainers: pVector (push_back) – Automatic vs. Manual (5% performance deterioration ) * Code written by Danny Rintoul at Sandia National Labs

31Texas A&M UniversitySTAPL Performance: Molecular Dynamics Number of particles 108K 23k Number of processors Execution Time (sec) Experimental results on HP V %-49% parallelized Input sensitive Use pTree on rest

32Texas A&M UniversitySTAPL Performance - Particle Transport* Generic particle transport solver – Regular and arbitrary grids – Numerically intensive, 25k line, C++ STAPL code – Sweep function unaware of parallel issues STAPL utilization: – pAlgorithms: p_for_each – pContainers: pVector (for data distribution) – Scheduler: determine grid data dependencies – Executor: satisfy data dependencies * Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI

33Texas A&M UniversitySTAPL Performance - Particle Transport Profile and Speedups on SGI Origin 2000 using 16 processors Code Region% Seq.Speedup Create computational grid Scattering across group-sets0.05N/A Scattering within a group-set Sweep Convergence across group sets0.05N/A Convergence within group sets0.05N/A Other2.59N/A Total

34Texas A&M UniversitySTAPL Performance - Particle Transport Experimental results on SGI Origin 2000

35Texas A&M UniversitySTAPL Summary Parallel equivalent to STL – Many codes can immediately utilize STAPL – Automatic translation Building block library – Portability (layered architecture) – Performance (adaptive) – Automatic recursive parallelism STAPL performs well in small pAlgorithm test cases and large codes

36Texas A&M UniversitySTAPL STAPL Status and Current Work pAlgorithms - fully implemented pContainers - pVector, pList, pTree pRange - mostly implemented Run-Time – Executorfully implemented – Scheduler fully implemented – Distributorwork in progress Adaptive mechanism (case study – sorting) OpenMP + MPI (mixed)work in progress – OpenMP versionfully implemented – MPI versionwork in progress

37Texas A&M UniversitySTAPL Project funded by Project funded by NSF NSF DOE DOE