Download presentation
Presentation is loading. Please wait.
Published byJeffrey Page Modified over 9 years ago
1
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory
2
Dagstuhl: Consistency Models2Oct 17, 2003 UPC Collaborators This talk presents joint work with: Chuck Wallace, MTU Dan Bonachea, UCB Jason Duell, LBNL With input from the UPC Community, in particular Bill Carlson, IDA Brian Wibecan, HP The Berkeley UPC Group Christian Bell Dan Bonachea Wei Yu Chen Jason Duell Paul Hargrove Parry Husbands Costin Iancu Mike Welcome
3
Dagstuhl: Consistency Models3Oct 17, 2003 Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism Fixed at program start-up, typically 1 thread per processor Global address space model of memory Allows programmer to directly represent distributed data structures Address space is logically partitioned Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions Data layout and communication Performance transparency and tunability are goals Initial implementation can use fine-grained shared memory Suitable for current and future architectures Either shared memory or lightweight messaging is key Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)
4
Dagstuhl: Consistency Models4Oct 17, 2003 Why Another Language? MPI is current standard for programming large-scale machines But difficulty-of-use has left users behind Clusters of SMPs lead to two parallel programs in one Single model for shared and distributed memory machines Shared memory multiprocessors (SMPs, SGI Origin, etc.) Global address space machines (Cray T3D/E, X1) Remote put/get instructions, but no HW caching of remote data Distributed memory machines/clusters with fast communication Shmem, GASNet (LAPI, GM, Elan, SCI), Active Messages Software caching in some implementations UPC is popular within some government labs Commercial and Open Source compilers
5
Dagstuhl: Consistency Models5Oct 17, 2003 Global Address Space Several kinds of array distributions double a[n] a private n-element array on each processor shared double b[n] a n-element shared array, with cyclic mapping shared [4] double c[n] a block cyclic array with 4-element blocks Pointers for irregular data structures shared double *sp a pointer to shared data double *lp a pointer to local data (assumed private) Shared Global address space a[0] Private sp: a[1]a[P] lp:
6
Dagstuhl: Consistency Models6Oct 17, 2003 UPC Memory Model UPC has two types of memory accesses Relaxed: operation must respect local (on-thread) dependencies other threads may observe these operations happening in different orders Strict: operation must appear atomic all relaxed operations issued earlier must complete before all relaxed operations issued later must happen later Several ways to specify the access: strict shared int x;type qualifier #pragma upc_relaxedpragma #include include file
7
Dagstuhl: Consistency Models7Oct 17, 2003 Behavioral Approach Problems with operations specifications Implicit assumptions about implementation strategy (e.g., caches) May unnecessarily restrict implementations Intuitive in principle, but complicated in practice A Behavioral Approach Based on partial and total orders Using Sequential Consistency definition as model Processor order defines a total order on each thread Their union defines a partial order 9 a consistent total order that is correct as a serial execution P 0 P 1
8
Dagstuhl: Consistency Models8Oct 17, 2003 Some Basic Notation The set of operations is O t = the set of operations issued by thread t The set of memory operations is: M = {m 0, m 1, …} M t = the set of memory operations from thread t Each memory operations has properties Thread(m i ) is the thread that executed the operation Location(m i ) is the memory location involved Memory operations are partitioned into 6 sets, given by S = Strict, R =Relaxed, P =Private W =Write, R =Read (in the 2 nd position) Some useful groups: Strict(M) = SW(M) [ SR(M) W(M) = SW(M) [ RW(M) [ PW(M)
9
Dagstuhl: Consistency Models9Oct 17, 2003 Compiler Assumption For specification purposes, assume the code is compiled by a naïve compiler in to ISO C machine Real compilers may do optimizations E.g., reorder, remove, insert memory operations Even strict operations may be reordered with sufficient analysis (cycle detection) These must produce an execution whose input/output and volatile behavior is identical to that of an unoptimized program (ISO C)
10
Dagstuhl: Consistency Models10Oct 17, 2003 Orderings on Strict Operations Threads must agree on an ordering of: For pairs of strict accesses, it will be total: For a strict/relaxed pair on the same thread, they will all see the program order
11
Dagstuhl: Consistency Models11Oct 17, 2003 Orderings on Local Operations Conflicting accesses have the usual definition Given a serial execution S = [o 1,…o n ] defining < S let S t be the subsequence of operations issued by t S conforms to program order for thread t iff: S t is consistent with the program text for t (follows control flow) S conforms to program dependence order for t iff 9 a permutation P(S) such that: P(S) conforms to program order for t 8 (m 1, m 2 ) 2 Conflicting(M) m 1 < S m 2, m 1 < P(S) m 2
12
Dagstuhl: Consistency Models12Oct 17, 2003 UPC Consistency An execution on T threads with memory ops M is UPC consistent iff: 9 a partial < strict that orients all pairs in allStrict(M) And for each thread t 9 a total order < t on O t [ W(M) [ SR(M) < t is consistent with < strict All threads agree on ordering of strict operations < t conforms to program dependence order Local dependencies are observed < t is a correct execution Reads return most recent write values
13
Dagstuhl: Consistency Models13Oct 17, 2003 Intuition on Strict Oderings Each thread may “build” its own total order to explain behavior They all agree on the strict ordering shown above in black, but Different threads may see relaxed writes in different orders Allows non-blocking writes to be used in implementations Each thread sees own dependencies, but not those of other threads Weak, but otherwise there would be consistency requirements on some relaxed operations Preserving dependencies requires usual compiler/hw analysis P 0 P 1
14
Dagstuhl: Consistency Models14Oct 17, 2003 Synchronization Operations UPC has both global and pairwise synchronization In addition to the synchronization properties, they also have memory model implications: Locks upc_lock is a strict read upc_unlock is a strict write Barriers (which may be split-phase) upc_notify (begin barrier) is a strict write upc_wait (end of barrier) is a strict read upc_barrier = upc_notify; upc_wait (More technical details in definitions as to the variable being read/written)
15
Dagstuhl: Consistency Models15Oct 17, 2003 Properties of UPC Consistency A program containing only strict operations is sequentially consistent A program that produces only race-free executions is sequentially consistent A UPC consistent execution of a program is race-free if for all threads t and all enabling orderings < t For all potential races: If m 1 < t m 2 then 9 synchronization operations o 1, o 2 such that m 1 < t o 1 < t o 2 < t m 2 and Thread ( o 1 ) = Thread ( m 1 ) and Thread ( o 2 ) = Thread ( m 2 ) and either o 1 is upc_notify and o 2 is upc_wait or o 1 is upc_unlock and o 2 is upc_lock on the same lock variable
16
Dagstuhl: Consistency Models16Oct 17, 2003 Alternative Models As specified, two relaxed writes to the same location may be viewed differently by different processors Nothing to force eventual consistency (likely in implementations) May add this to barrier points, at least So far it looks ad hoc Adding directionality to reads/writes seems reasonable Strict reads “fence” things that follows Strict writes “fence” things that preceed Simple replace for StrictOnThreads definition Support user-defined synchronization primitive built from strict operations
17
Dagstuhl: Consistency Models17Oct 17, 2003 Future Plans Show that various implementations satisfy this spec Use of non-blocking writes for relaxed writes with write fench/synch at strict points Compiler-inserted prefetching of relaxed reads Compiler-inserted “message vectorization” to aggregate a set of small operations into one larger one A software caching implementation with cache flushes at strict points Develop an operational model and show equivalence (or at least that it implements the spec) Define the data unit of atomicity Fundamental unit of interleaving, Data tearing, Conflicts
18
Dagstuhl: Consistency Models18Oct 17, 2003 Conclusions Behavioral specifications Are relatively concise Not intended for most end-users: they would see “properties” part Avoids reference to implementation-specific notions, and is likely to constrain implementations less than operational specs UPC Has user-control specification model at the language level Language model need not match that of the underlying machine It may be stronger (by inserting fences) It may be weaker (by reordering operations at compile-time) Seems to be acceptable within high end programming community (also evidence in the MPI-2 spec)
19
Backup Slides
20
Dagstuhl: Consistency Models20Oct 17, 2003 Communication Support Today Potential performance advantage for fine-grained, one-sided programs Potential productivity advantage for irregular applications
21
Dagstuhl: Consistency Models21Oct 17, 2003 Hardware Limitations to Software Innovation Software send overhead for 8-byte messages over time. Not improving much over time (even in absolute terms)
22
Dagstuhl: Consistency Models22Oct 17, 2003 Example: Berkeley UPC Compiler UPC Higher WHIRL Lower WHIRL Compiler based on Open64 Multiple front-ends, including gcc Intermediate form called WHIRL Current focus on C backend IA64 possible in future UPC Runtime Pointer representation Shared/distribute memory Communication in GASNet Portable Language-independent Optimizing transformations C + Runtime Assembly: IA64, MIPS,… + Runtime
23
Dagstuhl: Consistency Models23Oct 17, 2003 Research Opportunities Compiler analysis and optimizations Recognize local accesses and avoid runtime checks/storage Communication and memory optimizations Separate get/put initiation from synchronization (prefetch) Message aggregation (fine to bulk), tiling, and caching Language design Dynamic parallelism for load balance Multiscale parallelism: express parallelism at all levels Linguistic support for unstructured and sparse data structures Annotations, types, pragmas for correctness and performance Higher-level languages Parallel Matlab or parallelizing Matlab compilers Domain-specific parallelism
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.