Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Slides:

Advertisements

Similar presentations

More on File Management

Advertisements

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

CSCC69: Operating Systems

Chapter 6 Process Synchronization Bernard Chen Spring 2007.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

Cmput Lecture 8 Department of Computing Science University of Alberta ©Duane Szafron 2000 Revised 1/26/00 The Java Memory Model.

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.

Run time vs. Compile time

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.

1 I/O Management in Representative Operating Systems.

Synchronization (Barriers) Parallel Processing (CS453)

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Distributed-Memory Programming Using MPIGAP Vladimir Janjic International Workhsop “Parallel Programming in GAP” Aug 2013.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.

12/1/98 COP 4020 Programming Languages Parallel Programming in Ada and Java Gregory A. Riccardi Department of Computer Science Florida State University.

The Performance of Microkernel-Based Systems

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

1 Programming Languages and the Software Production Process Informal Cardelli’s metrics of programming languages fitness to real-time applications: Economy.

CSC321 Concurrent Programming: §5 Monitors 1 Section 5 Monitors.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

Scott Ferguson Section 1

Kernel Locking Techniques by Robert Love presented by Scott Price.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.

The Mach System Silberschatz et al Presented By Anjana Venkat.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.

Using Charm++ with Arrays Laxmikant (Sanjay) Kale Parallel Programming Lab Department of Computer Science, UIUC charm.cs.uiuc.edu.

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

Tutorial 2: Homework 1 and Project 1

Productive Performance Tools for Heterogeneous Parallel Computing

The Mach System Sri Ramkrishna.

For Massively Parallel Computation The Chaotic State of the Art

Operating Systems (CS 340 D)

Sending a few stragglers home

Performance Evaluation of Adaptive MPI

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

EE 193: Parallel Computing

Shared Memory Programming

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Case Studies with Projections

BigSim: Simulating PetaFLOPS Supercomputers

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

An Orchestration Language for Parallel Objects

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

Lecture 18: Coherence and Synchronization

Presentation transcript:

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign Supported in part by IACAT

Prologue I will present an abbreviated version of the planed talk –We are running late.. –Also, I realized that what I really intended to present, with code examples, will need an hour long talk.. –We will write that in a report later (may be put it in charm documentation) 1/26/2016Charm++ Workshop2

Outline –Charm++ designed for portability between shared and distributed memory –Optimizing multicore charm K-neighbor and its description and performance What optimizations were carried out –Abstractions: Basic: shared object space, Readonly data Plain global variables: still work.. More on disciplined use of these later Nodegroups Passing pointers to shared data structures, including sections of arrays. –Readonly, write-exclusive: permissions by convention or “capability” 1/26/2016Charm++ Workshop3

Optimizing SMP implementation of Charm++ Changed memory allocator – to avoid acquiring a lock per memory allocation Reduced the granularity of critical region Used thread local storage (__thread) to avoid false sharing Use memory fence instead of lock for pcqueue Reduce lock contention by using a separate msg queue for every other core on the same node Simplify the data structure of pcqueue –Assumes queuesize is adequately large 1/26/2016Charm++ Workshop4

Results on SMP Performance Improvement on K-Neighbor Test (8 cores, Mar’2009)

Results on SMP Performance Improvement on K-Neighbor Test (24 cores, Mar’2009)

Results on SMP Performance Improvement on K-Neighbor Test (16 cores, Apr’2009)

1/26/2016Charm++ Workshop8 We evaluated many of our applications to test and demonstrate the efficacy of the optimized SMP runtime

Jacobi 2D stencil computation on Power 5 (8000x8000 matrix size)

1/26/2016Charm++ Workshop10 ChaNGa: Barnes-Hut based production astronomy code

1/26/2016Charm++ Workshop11 ChaNGa: Barnes-Hut based production astronomy code

NAMD Scaling with Optimization NAMD apoa1 running on upcrc

1/26/2016Charm++ Workshop13 Summary of constructs that use shared memory in Charm++

Basic Mechanisms Chares and Chare array constitute a “shared object space” –Analogous to shared address space Readonly globals –Initialized in main::main or any method called from it synchronously Shared global variables 1/26/2016Charm++ Workshop14

More powerful mechanisms Node groups Passing pointers to shared data structures, including sections of arrays. –Readonly, write-permission 1/26/2016Charm++ Workshop15

16 Node Groups Node Groups - a collection of objects (chares) –Exactly one representative on each node Ideally suited for system libraries on SMP –Similar to arrays: Broadcasts, reductions, indexing –But not completely like arrays: Non-migratable; one per node

Conditional packing Pass data structure between chares –Pass pointer (dest. within the node)‏ –PUP the entire structure (dest. outside the node)‏ Who owns the data and frees it? –Data structure must inherit from CkConditional Reference counted A data structure can contain info about an array section –Useful in cases like in-place sorting (e.g. quicksort)‏

Sharing Data and Conditional packing Pointers can be sent in “messages”, but they are packed to underlying data structures when going across nodes –(feature in chare kernel since 1989 or so!) Data structure being shared should be encapsulated, with a read or write “capability” –If I give you write access, I promise not to modify it, read it, or grant access to someone else –If I give you a read access, I promise not to change it until you are done 1/26/2016Charm++ Workshop18

Disciplined Sharing My pet idea: shared arrays with restricted modes –Readonly, write-exclusive, accumulate, and “owner-computes” –Modes can change at well-defined global synch points –Captures a large fraction of uses of shared arrays 1/26/2016Charm++ Workshop19