RMA Considerations for MPI-3.1 (or MPI-3 Errata)
Minor wording correction In our semantics and correctness section, for the SEPARATE memory model, we state that a PUT/ACCUMULATE must not access a target window once a local update or PUT/GET to an overlapping window has started. Pg. 455, lines 32-39 Recommendation: we change this to "... to the same or an overlapping window has started", since it obviously applies to the same window as well, and not just overlapping windows.
Minor wording correction (contd.) The standard currently states, "A put or accumulate must not access a target window once a load/store update or a put ... ". The term "load/store update" has caused a major confusion even to the RMA working group, for example on the slide that we put together for compatibility between load, store, put, get, accumulate, etc. Recommendation: It should really say "load/store accesses" -- that is, simultaneous loads and PUTs are not allowed in the SEPARATE model.
MPI_MODE_NOLOAD We currently have an assertion for MPI_MODE_NOSTORE Recommendation: Add MPI_MODE_NOLOAD Optimization for the SEPARATE memory model Logically, when the epoch starts, the public window is copied into the private window; when the epoch ends, the private window is copied into the public window (will be the actual implementation for software managed caches) MPI_MODE_NOSTORE allows the MPI implementation to ignore the copy at the end of the epoch MPI_MODE_NOLOAD + MPI_MODE_NOSTORE will allow the MPI implementation to ignore the copy at the start of the epoch Alternatives: we actually only need an MPI_MODE_NOLOADSTORE, but MPI_MODE_NOLOAD is cleaner and symmetric
Info argument for req_mem_model Currently, the user does not request for a specific memory model – the MPI implementation has to decide this on its own If the hardware does not provide cache-coherence, the implementation is expected to return MPI_WIN_SEPARATE If the hardware does provide cache-coherence, the memory model provided is not straight-forward: MPI_WIN_UNIFIED will make load/store inter-leaving with PUT/GET more straightforward for non-overlapping memory regions MPI_WIN_SEPARATE will allow the MPI implementation to maintain a separate public window region (e.g., it can be allocated in symmetric address space or shared memory space) MPI_WIN_SEPARATE might be sufficient for the application if it doesn’t plan to do load/store operations Recommendation: add an info argument req_mem_model to window creation and define UNIFIED > SEPARATE (provided value can already be queried as an attribute)
MPI_WIN_SYNC clarification The current definition of MPI_WIN_SYNC states that it synchronizes the public and private memories of a window This leads us to believe that this is not required for the UNIFIED memory model Consider this example: P0 P1 MPI_Win_lockall MPI_Win_lockall MPI_Put(P0, X, 1); MPI_Win_flush(); MPI_Barrier MPI_Barrier Y = X; P0 cannot do a load operation from “X”, unless all MPI operations (MPI_Barrier in this example) do a memory barrier Recommendation: Clarify WIN_SYNC to say that it is still required in the UNIFIED memory model as well, for overlapping memory accesses
Type-safe MPI Windows The disp_unit is a weird semantic which is really meant to demonstrate what datatype the application will be using Causing confusion with the “same_size” and “same_disp_unit” hints which make little sense on heterogeneous systems Provides MPI with no type-checking capability when the same memory location is accessed as “int” or “float” when they are of the same size Recommendation: provide type-safe MPI window creation routines MPI_Win_create_type, MPI_Win_allocate_type, MPI_Win_allocate_shared_type MPI operations restricted to the window base type or datatypes created out of this type (base type can be derived as well) Still backward compatible with the current model, if MPI_BYTE is used
Load vs. Store distinction in the SEPARATE model The MPI standard does not distinguish load and store operations with respect to window semantics In the SEPARATE model, when the epoch is closed, the private memory is copied into the public memory When the user passes the MPI_MODE_NOSTORE assertion, the MPI implementation is not required to do the copy, but might still copy MPI-3 states that if you start doing load/store operations on a window, you cannot do PUT/GET operations within the same epoch This is valid in general, but too restrictive when the MPI implementation is not actually doing the copy Recommendation: allow the application to use more relaxed semantics when it is not doing STORE operations Possible solution: if we update the window creation prototypes for type-safe windows, we can consider adding a new parameter to define this