A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.

Slides:

Advertisements

Similar presentations

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Advertisements

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Type Systems For Distributed Data Sharing Ben Liblit Alex AikenKathy Yelick.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Computer Architecture II 1 Computer architecture II Lecture 9.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Communication API and Language. Communication API Issues –Should there be a common communication API? Generally a good idea –Make network vendors do more.

A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Evaluation of Memory Consistency Models in Titanium.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

Programming in UPC Burt Gordon HPN Group, HCS lab Taken in part from a Presentation by Tarek El-Ghazawi at SC2003.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Parallel Computing Presented by Justin Reschke

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Lecture 20: Consistency Models, TM

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Overview of Berkeley UPC

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

Programming Models for SimMillennium

Amir Kamil and Katherine Yelick

Threads and Memory Models Hal Perkins Autumn 2011

Memory Consistency Models

Amir Kamil and Katherine Yelick

Relaxed Consistency Finale

Programming with Shared Memory Specifying parallelism

Presentation transcript:

A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory

Dagstuhl: Consistency Models2Oct 17, 2003 UPC Collaborators This talk presents joint work with: Chuck Wallace, MTU Dan Bonachea, UCB Jason Duell, LBNL With input from the UPC Community, in particular Bill Carlson, IDA Brian Wibecan, HP The Berkeley UPC Group Christian Bell Dan Bonachea Wei Yu Chen Jason Duell Paul Hargrove Parry Husbands Costin Iancu Mike Welcome

Dagstuhl: Consistency Models3Oct 17, 2003 Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism Fixed at program start-up, typically 1 thread per processor Global address space model of memory Allows programmer to directly represent distributed data structures Address space is logically partitioned Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions Data layout and communication Performance transparency and tunability are goals Initial implementation can use fine-grained shared memory Suitable for current and future architectures Either shared memory or lightweight messaging is key Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)

Dagstuhl: Consistency Models4Oct 17, 2003 Why Another Language? MPI is current standard for programming large-scale machines But difficulty-of-use has left users behind Clusters of SMPs lead to two parallel programs in one Single model for shared and distributed memory machines Shared memory multiprocessors (SMPs, SGI Origin, etc.) Global address space machines (Cray T3D/E, X1) Remote put/get instructions, but no HW caching of remote data Distributed memory machines/clusters with fast communication Shmem, GASNet (LAPI, GM, Elan, SCI), Active Messages Software caching in some implementations UPC is popular within some government labs Commercial and Open Source compilers

Dagstuhl: Consistency Models5Oct 17, 2003 Global Address Space Several kinds of array distributions double a[n] a private n-element array on each processor shared double b[n] a n-element shared array, with cyclic mapping shared [4] double c[n] a block cyclic array with 4-element blocks Pointers for irregular data structures shared double *sp a pointer to shared data double *lp a pointer to local data (assumed private) Shared Global address space a[0] Private sp: a[1]a[P] lp:

Dagstuhl: Consistency Models6Oct 17, 2003 UPC Memory Model UPC has two types of memory accesses Relaxed: operation must respect local (on-thread) dependencies other threads may observe these operations happening in different orders Strict: operation must appear atomic all relaxed operations issued earlier must complete before all relaxed operations issued later must happen later Several ways to specify the access: strict shared int x;type qualifier #pragma upc_relaxedpragma #include include file

Dagstuhl: Consistency Models7Oct 17, 2003 Behavioral Approach Problems with operations specifications Implicit assumptions about implementation strategy (e.g., caches) May unnecessarily restrict implementations Intuitive in principle, but complicated in practice A Behavioral Approach Based on partial and total orders Using Sequential Consistency definition as model Processor order defines a total order on each thread Their union defines a partial order 9 a consistent total order that is correct as a serial execution P 0 P 1

Dagstuhl: Consistency Models8Oct 17, 2003 Some Basic Notation The set of operations is O t = the set of operations issued by thread t The set of memory operations is: M = {m 0, m 1, …} M t = the set of memory operations from thread t Each memory operations has properties Thread(m i ) is the thread that executed the operation Location(m i ) is the memory location involved Memory operations are partitioned into 6 sets, given by S = Strict, R =Relaxed, P =Private W =Write, R =Read (in the 2 nd position) Some useful groups: Strict(M) = SW(M) [ SR(M) W(M) = SW(M) [ RW(M) [ PW(M)

Dagstuhl: Consistency Models9Oct 17, 2003 Compiler Assumption For specification purposes, assume the code is compiled by a naïve compiler in to ISO C machine Real compilers may do optimizations E.g., reorder, remove, insert memory operations Even strict operations may be reordered with sufficient analysis (cycle detection) These must produce an execution whose input/output and volatile behavior is identical to that of an unoptimized program (ISO C)

Dagstuhl: Consistency Models10Oct 17, 2003 Orderings on Strict Operations Threads must agree on an ordering of: For pairs of strict accesses, it will be total: For a strict/relaxed pair on the same thread, they will all see the program order

Dagstuhl: Consistency Models11Oct 17, 2003 Orderings on Local Operations Conflicting accesses have the usual definition Given a serial execution S = [o 1,…o n ] defining < S let S t be the subsequence of operations issued by t S conforms to program order for thread t iff: S t is consistent with the program text for t (follows control flow) S conforms to program dependence order for t iff 9 a permutation P(S) such that: P(S) conforms to program order for t 8 (m 1, m 2 ) 2 Conflicting(M) m 1 < S m 2, m 1 < P(S) m 2

Dagstuhl: Consistency Models12Oct 17, 2003 UPC Consistency An execution on T threads with memory ops M is UPC consistent iff: 9 a partial < strict that orients all pairs in allStrict(M) And for each thread t 9 a total order < t on O t [ W(M) [ SR(M) < t is consistent with < strict All threads agree on ordering of strict operations < t conforms to program dependence order Local dependencies are observed < t is a correct execution Reads return most recent write values

Dagstuhl: Consistency Models13Oct 17, 2003 Intuition on Strict Oderings Each thread may “build” its own total order to explain behavior They all agree on the strict ordering shown above in black, but Different threads may see relaxed writes in different orders Allows non-blocking writes to be used in implementations Each thread sees own dependencies, but not those of other threads Weak, but otherwise there would be consistency requirements on some relaxed operations Preserving dependencies requires usual compiler/hw analysis P 0 P 1

Dagstuhl: Consistency Models14Oct 17, 2003 Synchronization Operations UPC has both global and pairwise synchronization In addition to the synchronization properties, they also have memory model implications: Locks upc_lock is a strict read upc_unlock is a strict write Barriers (which may be split-phase) upc_notify (begin barrier) is a strict write upc_wait (end of barrier) is a strict read upc_barrier = upc_notify; upc_wait (More technical details in definitions as to the variable being read/written)

Dagstuhl: Consistency Models15Oct 17, 2003 Properties of UPC Consistency A program containing only strict operations is sequentially consistent A program that produces only race-free executions is sequentially consistent A UPC consistent execution of a program is race-free if for all threads t and all enabling orderings < t For all potential races: If m 1 < t m 2 then 9 synchronization operations o 1, o 2 such that m 1 < t o 1 < t o 2 < t m 2 and Thread ( o 1 ) = Thread ( m 1 ) and Thread ( o 2 ) = Thread ( m 2 ) and either o 1 is upc_notify and o 2 is upc_wait or o 1 is upc_unlock and o 2 is upc_lock on the same lock variable

Dagstuhl: Consistency Models16Oct 17, 2003 Alternative Models As specified, two relaxed writes to the same location may be viewed differently by different processors Nothing to force eventual consistency (likely in implementations) May add this to barrier points, at least So far it looks ad hoc Adding directionality to reads/writes seems reasonable Strict reads “fence” things that follows Strict writes “fence” things that preceed Simple replace for StrictOnThreads definition Support user-defined synchronization primitive built from strict operations

Dagstuhl: Consistency Models17Oct 17, 2003 Future Plans Show that various implementations satisfy this spec Use of non-blocking writes for relaxed writes with write fench/synch at strict points Compiler-inserted prefetching of relaxed reads Compiler-inserted “message vectorization” to aggregate a set of small operations into one larger one A software caching implementation with cache flushes at strict points Develop an operational model and show equivalence (or at least that it implements the spec) Define the data unit of atomicity Fundamental unit of interleaving, Data tearing, Conflicts

Dagstuhl: Consistency Models18Oct 17, 2003 Conclusions Behavioral specifications Are relatively concise Not intended for most end-users: they would see “properties” part Avoids reference to implementation-specific notions, and is likely to constrain implementations less than operational specs UPC Has user-control specification model at the language level Language model need not match that of the underlying machine It may be stronger (by inserting fences) It may be weaker (by reordering operations at compile-time) Seems to be acceptable within high end programming community (also evidence in the MPI-2 spec)

Backup Slides

Dagstuhl: Consistency Models20Oct 17, 2003 Communication Support Today Potential performance advantage for fine-grained, one-sided programs Potential productivity advantage for irregular applications

Dagstuhl: Consistency Models21Oct 17, 2003 Hardware Limitations to Software Innovation Software send overhead for 8-byte messages over time. Not improving much over time (even in absolute terms)

Dagstuhl: Consistency Models22Oct 17, 2003 Example: Berkeley UPC Compiler UPC Higher WHIRL Lower WHIRL Compiler based on Open64 Multiple front-ends, including gcc Intermediate form called WHIRL Current focus on C backend IA64 possible in future UPC Runtime Pointer representation Shared/distribute memory Communication in GASNet Portable Language-independent Optimizing transformations C + Runtime Assembly: IA64, MIPS,… + Runtime

Dagstuhl: Consistency Models23Oct 17, 2003 Research Opportunities Compiler analysis and optimizations Recognize local accesses and avoid runtime checks/storage Communication and memory optimizations Separate get/put initiation from synchronization (prefetch) Message aggregation (fine to bulk), tiling, and caching Language design Dynamic parallelism for load balance Multiscale parallelism: express parallelism at all levels Linguistic support for unstructured and sparse data structures Annotations, types, pragmas for correctness and performance Higher-level languages Parallel Matlab or parallelizing Matlab compilers Domain-specific parallelism