Presentation is loading. Please wait.

Presentation is loading. Please wait.

MYO: Shared Memory Programming Model for Intel® MIC Ravi Ganapathi

Similar presentations


Presentation on theme: "MYO: Shared Memory Programming Model for Intel® MIC Ravi Ganapathi"— Presentation transcript:

1 MYO: Shared Memory Programming Model for Intel® MIC Ravi Ganapathi

2 MYO – Shared Programming Model
Allocate Shared virtual memory(SVM) on Xeon® or MIC Access the allocated memory on Xeon® and MIC. Address pointers in SVM are valid on both Xeon® and MIC Seamlessly share complex data structures All data transfers are implicit between Xeon® host and MIC Remote procedures can be called in following ways Xeon® host to MIC MIC to Xeon® host MYO API’s are used to allocate SVM.

3 MYO – Shared Programming Model
Xeon® virtual address MIC virtual MYO Shared Virtual Memory Data structure MYO API’s are used to allocate SVM.

4 Offload Model(Non Shared)
Allocate memory on host Explicit data transfer between host and device Transfer data with knowledge of input/output Explicit data synchronizations Marshal and un-marshal of data Convert data to sequential form for transfer Ex: Balanced tree to array and vice versa. Marshaling and un-marshaling of data refers to converting complex data structure to sequential form, transfer and then recreate the structure

5 CPU-GPU/MIC Interaction: Offload Model
Buffer Buffer Transfer to GPU/MIC Programmer serializes into push buffer Programmer recreates the structure CPU GPU/MIC True of CUDA, OpenCL, DX etc. Separate Memory Spaces Data structure Several person-months coding for new features

6 Why MYO? Focus on the problem, not on implementation
MYO hides programming complexity from users Seamless sharing of complex pointer-containing data structures Implicit pipelining, overlap of communication and computation. Shared memory programming for CPU and MIC. Adjacent bytes can be modified in parallel by diff nodes. Support for global shared memory sync primitives.

7 Why MYO? Applications are not just compute kernels
Real apps consists of serial and parallel sections Execute the sequential code on Host Run parallel sections on MIC Seamlessly switch execution between host and MIC Implicit data transfers between compute domains Default is transfer data on demand Some overlap of communication and computation 7

8 MYO Program Visualization

9 How MYO works? MYO uses standard Release Consistency model for data transfer Release point: All prior stores are guaranteed to be globally visible Think of it as local stores are guaranteed to be flushed out Acquire point: All stores that are globally visible are seen locally. Think of it as syncing up local memory with global store Release Consistency Model consists of two modes, MYO has a third (or 3?) mode(s?) Lazy Update – Update memory on demand. Eager Update- Update on acquire. Transfer entire data. Hybrid update – MYO specific protocol for optimal performance. Hybrid update is the default mode in MYO, appears to be similar to Lazy. On an average, hybrid update is seen to give optimal performance on most workloads. Next two slides requires explanation to the customer. Visualization of hybrid and lazy are same except for runtime optimizations on granularity of data and reducing page faults. Ron: What are the 3 MYO modes? Or are these the 3 modes above?

10 Our mode example (Lazy Update)
mic_foo(){ tmp= b + 20; a=tmp; } Acquire //a, b are in page Pi // c is in page Pk a=10; b=1; c=b; mic_foo(); // remote call tmp=a; PageFault PageFault writable PageFault PageFault Release Release P0 P0 a:0 b:0 c:0 Directory Pi Directory a:10 b:1 a:0 b:0 a:10 b:1 a:10 b:1 a:10 b:0 a:21 b:1 a:0 b:0 PCI Aperture Pi a:21 b:1 a:10 b:1 a:0 b:0 a:21 b:1 Pi Diff clean dirty clean dirty Pi Shared Virtual memory (Ours) Weak release consistency handover for shared data using memory sync (acquire/release) Shared address space access detection Page fault exception Virtual memory protection Multiple-reader & multiple-writer Update on demand need AP/SP concept for multi-reader or multi-writer in lrb. i.e. when one thread update the page(need to be set to writable) from CPU, while another thread wants to write something to the page(need trigger pagefault to record the dirtybit). Also Atomic operation cannot guaranteed. We use SHMEM(shmat) to map 2 virtual address(AP/SP) to the same physical address. Update operation use SP(writable) to update page, while other working threads is protected by the pagefault happened in AP space. a:10 b:1 c:0 c:1 c:0 c:0 dirty clean dirty Pk Pk clean Pk Pk NonAccess ReadOnly Writeable 10

11 Our mode example (Eager Update)
Acquire mic_foo(){ tmp= b + 20; a=tmp; } //a, b are in page Pi // c is in page Pk a=10; b=1; c=b; mic_foo(); // remote call tmp=a; PageFault writable PageFault PageFault Release Release P0 P0 Directory Pi Directory a:0 b:0 a:21 b:1 a:10 b:0 a:10 b:1 A:10 b:1 a:0 b:0 a:10 b:1 PCI Aperture Pi a:21 b:1 a:10 b:1 a:10 b:1 a:0 b:0 a:21 b:1 Pi Diff dirty clean dirty a:10 b:1 clean Pi Shared Virtual memory (Ours) Weak release consistency handover for shared data using memory sync (acquire/release) Shared address space access detection Page fault exception Virtual memory protection Multiple-reader & multiple-writer c:1 c:1 c:0 c:0 c:1 c:0 dirty clean dirty Pk Pk Pk Pk clean NonAccess ReadOnly Writeable 11

12 Memory Management Arena based VSM CPU & MIC share VSM
Arena is the minimum unit for memory consistency protocol When no arena is specified, default arena is used by MYO. User can create arenas and perform malloc/free inside the arena. CPU & MIC share VSM MYO managed virtual memory are identical on host Xeon® and MIC Physical memory are distinct and managed locally by the host Xeon® and MIC Memory management details are hidden from the programmer. Shared memory properties can be set at the Arena level.

13 MYO Programming

14 MYO Program Structure Include files User Initialization Routine
myo.h myoimpl.h User Initialization Routine myoiUserInit{ Register Shared Variables and Remote Functions here. } Main Routine main { Initialize MYO …. Call RPC Finalize MYO Feel free to skip the programming section and walk through actual code if that works.

15 myoiUserInit Register Remote Functions and Shared Variables
myoiRemoteFuncRegister Register Function to allow calls from remote nodes Ex: myoiRemoteFuncRegister((MyoiRemoteFuncType) helloworld, “helloworld"); myoiVarRegister Register shared variables Ex: myoiVarRegister((void *)&buffer, "buffer");

16 Main routine Initialize myo Call Remote function Finalize MYO
myoiLibInit(NULL, (void*)&myoiUserInit) Call Remote function myoRelease(); //Release before RPC rpcHandle = myoiRemoteCall((char*)“FunctionName", argument, 0); myoiGetResult(rpcHandle); //Wait for RPC completion myoAcquire();//acquire after RPC return Finalize MYO myoiLibFini();

17 Hello World Client (Host)
int main(int argc, char * argv[]){ MyoiRFuncCallHandle rpcHandle ; myoiLibInit(NULL, (void*)&myoiUserInit) char * buffer; char temp[50] = "hello world"; buffer = (char *)myoSharedMalloc(sizeof(char) * MAXWORLD *2) strcpy(buffer, temp); printf("%s\n", buffer); myoRelease(); //Release before RPC rpcHandle = myoiRemoteCall((char*)"kernel", buffer, 0); //Function to run on MIC myoiGetResult(rpcHandle); myoAcquire();//acquire after RPC return myoiLibFini(); return (0); } MyoError myoiUserInit(void){ MyoError ret = MYO_SUCCESS; return ret;

18 Hello World Service (Device)
extern MyoError myoiUserInit(void); //Remote function void kernel(char * buffer) { myoAcquire(); //acquire in service char temp[50] = "world hello"; strcpy( buffer, temp); myoRelease(); //release before return } int main(int argc, char * argv[]){ myoiLibInit(NULL, (void*)&myoiUserInit); myoiLibFini(); return (0); //Init Function, Generated by Translator or by Hand MyoError myoiUserInit(void){ MyoError ret = MYO_SUCCESS; ret = myoiRemoteFuncRegister((MyoiRemoteFuncType) kernel, "kernel"); return ret;

19 Case Study: Volume Rendering(VR)
volume rendering is used to display a 2D projection of a 3D discretely sampled data set Favors MYO due to complex data structure. Sequential code: Init complex data structure Parallel Code: Volume Rendering kernel Optimal Solution: Init on host, run Render on MIC

20 Case Study: Volume Rendering(VR)
Execution mode Offload MYO (Hybrid) MYO (Eager) MYO(Lazy) Time(seconds) 6.7675 6.8815 Offload Read input from file on host Transfer data to device Init the data structure Execute the parallel rendering algorithm MYO Read input from file on host and Init the data structure Transfer data to device on demand(hybrid and lazy) or on acquire(Eager) Execute the parallel rendering algorithm

21 Summary: MYO Usage Model
We expect MYO has following advantages for MIC. Easy programming Shared memory programming model No data marshalling Competitive performance with message passing No Data marshalling overhead VSM with release consistency Fine-grained control of data sharing.

22 Q&A

23 Backup

24 Case Study: Volume Rendering(VR)
Offload myo (hybrid) myo (eager) myo (lazy) native trial 1 12.003 6.804 6.869 10.877 5.82 trial 2 11.157 6.708 6.969 10.734 5.88 trial 3 10.769 6.863 6.885 10.736 5.87 trial 4 11.124 6.648 6.857 10.828 6.05 trial 5 10.968 6.722 6.856 10.842 6.06 trial 6 10.651 6.765 6.881 10.634 6.03 trial 7 10.876 6.911 6.878 10.725 trial 8 10.782 6.719 10.905 5.96 averages 6.7675 6.8815

25 MYO Runtime APIs (myo.h) Windows/Linux SCIF API’s
MYO Runtime Modules MYO Runtime APIs (myo.h) Initialization Mem Consistency Mem mgmt RPC module Global Sync. Primitives Communication module Windows/Linux SCIF API’s


Download ppt "MYO: Shared Memory Programming Model for Intel® MIC Ravi Ganapathi"

Similar presentations


Ads by Google