Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.

Similar presentations


Presentation on theme: "Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang."— Presentation transcript:

1 Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang Other Collaborators: Min Feng, Srimat Chakradhar CSE Poster Event 2014 Motivation The Goal of This Work  Many-core coprocessors commonly have their own memory hierarchy – Intel Xeon Phi – NVIDIA GPUs Programming Challenges Experimental Results  CPU: Intel Xeon E5-2609 (8- Core)  Coprocessor: Intel Xeon Phi (61-Core) -- MIC  Compiler: ICC Contributions Static Mechanism and Runtime Mechanism Programming with LEO/OpenAcc  Design dynamic (runtime library) or static (code transformation) methods to manage and optimize data communication between CPU and many-core coprocessors automatically for multi-dimensional arrays and multi-level pointers – Minimize redundant data transfers – Utilize Direct Memory Accesses (DMA) – Reduce memory allocation on coprocessor – Preserve compiler optimization on coprocessor State of the Art  Comparison of best CPU+MIC and CPU Speedup of best CPU-MIC over 8-Core CPU  Study the performance bottleneck of the state-of- the-art dynamic and static methods  Design two novel heap Linearization algorithms and optimized MYO method to improve the communication performance  Implement a static source- to-source code transformer with the Partial Linearization with Pointer Reset design  Evaluate and analyze both dynamic and static approaches on multiple benchmarks to show the efficacy of our Partial Linearization with Pointer- Reset method Data Transfer CPU Host PCIe Many Core Coprocessor 8-core 60+ cores Intel MIC NVIDIA GPU … //Change Malloc-Site to split pointers and real data #pragma offload target(mic) in(A_data, B_data, C_data: length(m*n) REUSE) {} #pragma offload target(mic) nocopy(A, B, C:length(n) ALLOC) { //Connect A, B, C with A_data, B_data, C_data} #pragma offload target(mic) nocopy(A, B, C: length(n)) { #pragma omp parallel for private(i) for (i = 0; i < n; i++) for (j = 0; j < m; j++) A[i][j] = B[i][j] * C[i][j]; } #pragma offload target(mic) out(A_data: length(m*n) FREE) … Productivity Performance  Current Approaches to Managing the data transfer between CPU and Coprocessor Pros: Easy programming, Complex structures Cons: Slow (unnecessary synchronization) Pros: Fast Cons: Users manageable data offload Only bit-wise copyable data M Y O int * a int ** b Our Static Mechanism Our Combined Mechanism  Summary of Benchmarks  Comparison of Static Methods (Linearization) and OPT-Runtime (MYO) Speedup of Static over OPT-Runtime Data Trans Size of Static over OPT-Runtime  Comparison of OPT-Runtime and Runtime (MYO) Speedup of OPT-Runtime over RuntimeData Trans Size of OPT-Runtime over Runtime  Comparison of OPT-Complete Linearization and Complete Linearization Speedup of OPT-CL over CL for MG Data Trans Size of OPT-CL over CL for MG Partial Linearization with PR High Dim Array Addition Struct and Non- unit Stride Access  No modification to the access-site  Preserve potential compiler optimizations  Reduce possibility of introducing bugs  Reduce communication overhead  Only transfer linearized data  Minimize offloading number  DMA utilization  Linearized data is in a dense memory buffer – Explicit Message Passing – Virtual Shared Memory (MYO)


Download ppt "Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang."

Similar presentations


Ads by Google