Presentation is loading. Please wait.

Presentation is loading. Please wait.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Similar presentations


Presentation on theme: "Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop"— Presentation transcript:

1 Revisiting Kirchhoff Migration on GPUs 2015 Rice Oil & Gas HPC Workshop
This is a follow up talk to a talk for Scott’s talk on Kirchhoff Migration using GPUs here in 2008. I did an internship at Hess and that’s when we revisited the algorithms with certain goals in mind. I will try to convey the goals and our approaches to meet them in this talk. Rajesh Gandham, Rice University & Hess Corporation (intern) Thomas Cullison, Hess Corporation Scott Morton, Hess Corporation

2 Seismic Experiment - Keep this brief and make note of the rays because they will be a good lead/tie in to the next slide There is little we can do if people in the audience do not understand seismic Here is a cartoon of a seismic survey experiments. Some energy sent from a controlled source on earth surface to the earth. Some of the energy reflects from the interfaces in the earth structure. The reflected energy is recorded at the receivers located on the surface. Using the recordings, we hope to estimate the structure of the subsurface. There are several imaging approaches for this. 2 2

3 Kirchhoff Migration x y z t = Ts + Tr Add to Image Image trace
Data trace Image point Source Receiver Ts Tr Keep it brief. Talk about the two-way traveltime from the source to the receiver Mention that there is an offset between the source and receiver and that these offsets can vary both in magnitude and vector direction – this and the following line are seeds to extended imaging Mention that the rays come in and exit at various angles in the subsurface – this will be the seed for a brief mention of subsurface angle gather kernel test One such method allowing us to estimate the structure is Kirchhoff Migration. In this method, waves are represented by rays. In this picture, a ray takes certain time to reach from source to image and reflected ray takes Tr time to reach a receiver. Data recorded at the receiver is put at all the places it the subsurface where it could have come from. Adding one receiver data to one image point is one migration contribution. The method is formally represented as integration of the data recorded across all source-receivers on the surface. Some interesting things are to look at how the image depends on the source-rec configuration, angle of reflection etc. Those questions require us to produce general migration images on higher dimension space. 3 3

4 Seismic Image Keep this brief This is an example of a seismic image
Several derived features and properties are shown Some structural features and some physical properties that require extended imaging Here is an example image obtained from migration Image shows several features and properties that required general image gather 4 4

5 Project Goals Hardware portability General image gathers
Improve migration performance The production implementations use CUDA programming model to take advantage of NVIDIA GPUs only. We wanted to look at the feasibility of portable implementations in production environment. We wanted to extend the imaging capabilities of the production implementations And improve the overall performance of the migration in the context of added new features 5 5

6 Hardware portability Project Goals General image gathers
Improve migration performance First lets see how we achieved portability 6 6

7 OCCA for Portability The frontend, backend, middle-man and any features of OCCA can be discussed here. Be sure to point of the other HW accelerators, especially the AMD GPUs We adopted OCCA programming model for portability. David has discussed about OCCA in previous talk. We write kernels in OCCA kernel language. These kernels can be compiled with several threading models to be able to run on CPUs, GPUs and other accelerators that support these languages. The kernels can be called from C, C++,,etc In this work we mainly focussed on GPUs using CUDA and OpenCL. Because GPUs provide low cost per performance at the moment. 7 7

8 Portability Results Ported and tested production kernel from CUDA to OCCA in ~3 weeks Tested and verified kernel results on CPU and GPU Tested production migration on GPUs Performance Greater kernel performance because of runtime compilation Kernels still need some tuning for best performance on various architectures Be sure to point out that we tested the occa production kernel on the CPU and compared/verified its results with the GPU We DID test a production migration of the OCCA ported production kernel Remember that the range of performance increase was as low as 5-10% for 1.3 ARCH and just under 2x for K10 OCCA uses GPU model of parallelism. That is thread blocks and threads. So it was easy to port CUDA kernels to OCCA language. We were able to the port and test the kernels in about 3 weeks. Verified the GPU results with CPU Tested a production migration with the ported kernels We did not lose any performance with the ported kernels. In fact, we saw some performance gains. This is because kernels are compiled at run-time in OCCA. This way, compiler does more optimizations for each individual runs. 8 8

9 General image gathers Project Goals Hardware portability
Improve migration performance We got the hardware flexibility with OCCA. Next, we wanted to look at the flexibility in the physics. 9 9

10 Standard Kirchhoff Imaging
Pre-compute coarse travel times from surface locations to image points 4D surface integral through a 5D data set to 3D image Computational complexity: NI ~ 1010, number of output image points ND ~ 109, number of input data traces f ~ 10, number of cycles/image-point/trace fNlND ~ 1020, number of cycles ~ 103 CPU core years Take some time with this slide Point of the equation and the computational complexity carefully You’ll want to flip back and forth between this slide and the next slide AFTER you have carefully explained the next slide In Kirchhoff imaging, first we compute the travel-times from surface locations to image locations on a coarse grid with Ray tracing. We wont focus that in this talk. We compute a 3D image with 4D surface integral of the 5D data set. One such image computation involves generating image at 10 billion points using a billion number of traces. This roughly needs 1000 CPU core years. So we have to focus on efficiency. 10 10

11 Kirchhoff Gather Imaging
Pre-compute coarse travel times from surface locations to image points 4D surface integral through a 5D data set to 4D/5D image Image Gathers Offset Offset vectors tile (OVT) Subsurface angles etc… Take your time on this slide Point of the ‘g’ function variable Go over the changes of the Computation complexity Switch back and forth between this slide and the previous slide and point out the ‘g’ function variable and the differences in the Computational complexity, also point out that the total number of compute cycles is the same. When we want to produce general gather images, we do similar computations. The difference is that instead of generating image on a 3D volume, we generate on either 4D or 5D or even higher dimensional image. The additional dimension can be surface offset, surface offset vector, or reflection angles. The current production version is capable of generating offset gathers. Here we encode all the gathers in a single index “g” and write a kernel to generate image in this general framework. 11 11

12 Improve migration performance
Project Goals Hardware portability General image gathers Improve migration performance We got portability and flexibility in imaging. Next, we looked at improving the overall performance in the new general framework. First lets step back and see what we did in the current production. 12 12

13 Previous Approach Define tasks that can be run in parallel
Task should be small enough to fit on a GPU Copying data to and from the GPU is expensive Global memory access can be a bottleneck The image computation is divided in to independent tasks that can be run in parallel. Each of this task is small enough to fit on a GPU Major challenges are: Moving the data to and from the device is expensive. So we need to minimize data movement. Also accessing the memory on the device is expensive. 13 13

14 Previous Approach Production code is designed keeping all of them in mind. Here, I am showing a cartoon of the image volume. The coarse cell represents the travel time locations. Fine cells are where the actual image is computed. 14 14

15 ↔ Previous Approach This is an animated slide
The point of this slide is to point out that the production kernel was specifically tuned for the NVIDIA GPUs and CUDA We know that some tuning will be needed to get the best performance for various accelerators, but we want to reduce that time. The column of travel time cells are mapped on the grid of thread blocks. One thread block does computations on one column of travel time cells In each thread block, two dimensional threads traverse through layers and compute image on the fine grid. 15 15

16 Previous Approach Overview
~32 traces per task Big image block per task One gather bin per task Pre-filter the data Resample the data CUDA programming model Take your time on this slide. Need to be clear and precise about the bottom two comments and how they tie-in above. For example, people will be shocked that there was only a ~3X improvement when to show the next slide. To summarize, we load few number of traces, and produce a large image volume with only one gather bin per task. The data is pre-filtered for anti-aliasing, and re-sampled for accurate interpolation before moving the data to device. These two require expanding the amount of data loaded on to device. Use CUDA programming model NVIDIA GPUs. Now we have an additional constraint of producing image on multiple gather bins. 16 16

17 New Approach for Performance
Define tasks that can be run in parallel Task should be small enough to fit on a GPU Copying data to and from the GPU is expensive Global memory access can be a bottleneck Improve FLOPs/load This is a key transition slide. Take your time Setup the audience and the rest of the talk. Improved flop/load is KEY! And, the points above it reinforce this In the new design we took in to account of improving the overall flops data loaded on to device. 17 17

18 New Approach parallelism in data traces parallelism in image volume
Take some time with the this slide Goal is to get the audience to free there mind of the structure you just showed them and get them to move into the more abstract realm of general domain decomposition to take advantage of the parallel model used by OCCA FIX the CUBE slide. Get rid of the x-y-z axis and if possible break the lateral columns in halves or thirds There is parallelism in input data traces that are on the surface. There is parallelism in output image volume We need to figure out the surface patch size and the image chunk that needs to be assigned for task, with a goal of increasing flop/load. For this we did some parameter analysis of the efficiency parallelism in image volume 18 18

19 Parameter Analysis Take some time with this slide
You need to tell a story here Though the diagram is take from our analysis, here it is more of a cartoon tool to get people to understand In this slide we are showing one example of analysis for a particular set of parameters. On the left, we are varying the length of the 4D surface data traces On the right we are varying the length of the 3D image volume We are measuring the performance with the migration contr per byte of data loaded We have a wide range of input data and output image volume for optimal efficiency We see a sudden decline in efficiency. This is because we are running out of memory on the GPU In the production version we are in the regime of very few traces and large output image block 19 19

20 New Approach Implementation for general image gathers
Offset gather, OVT gather, reflection angle gather, etc. Produce a small chunk of image quickly See imaging results as each task finishes Improve the overall performance on new hardware The production code was optimized for CUDA and NVIDIA GPUs in 2008/2009 Develop portable software Hardware architectures change relatively fast Several vendors and varieties of accelerators Several parallel models for various languages Be sure to point out the GENERAL gather frame work Be sure to point out that the goal is for a kernel that is portable across generations, vendors (i.e. AMD and NVIDIA), and accelerator types (Phi, FPGA, GPU, CPU). We implemented the new general image gather in a single kernel We load more traces and small image volume so that we produce small image chunks quickly Designed the new kernels for NVIDIA 3.0 architecture while the production implementations are designed for 1.3 architecture We kept of these in a portable umbrella OCCA. 20 20

21 Production vs New Approach
~32 traces per task Big image block per task One gather bin per task Pre-filter the data Resample the data CUDA programming model ~200k traces per task Small image block per task Multiple gather bins per task Filter on the fly Interpolate on the fly OCCA programming approach Take your time on this slide. Need to be clear and precise about the bottom two comments and how they tie-in above. For example, people will be shocked that there was only a ~3X improvement when to show the next slide. To summarize the differences between production approach and new approach. We assign many traces per task, and produce small image with multiple gather bins per task. We are able to do this because we got away with pre-filtering and resampling before moving the data on to the device. Instead we filter and interpolate on the fly. We essentially traded memory overhead for computations. This results in very high flop/byte. Moved from CUDA to more general programming model OCCA. Avoiding pre-filtering (for anti-aliasing) & resampling Reduces memory overhead Increase the number of computations per migration contribution Greater FLOPs/Byte 21 21

22 Production vs New Approach
Million Migration-Contributions/s Take your time on this slide. Need to be clear and precise about the bottom two comments and how they tie-in above. For example, people will be shocked that there was only a ~3X improvement when to show the next slide. We are comparing the perf of production and new approaches We fix the input data size and vary the output image volume What we see here is that the new approach becomes efficient even for very small image volume while the production approach needs to generate large image volume to be efficient. Also here, we don’t include the time for pre-filtering and resampling But this does not give enough information yet. Because the new one designed for small image volumes while the production one is designed for large volumes. We would know the overall benefits only when we do a full production migration. Output Image Block Length (m) Input traces are fixed at ~177,000 (Nvidia K10) Pre-filtering and resampling of production code is not included 22 22

23 New Approach Outcomes Improved production performance best guess (~2X)
Generalized gather kernel framework Portable implementation Tested and verified CPU vs GPU results Tested and compared OpenCL vs CUDA Performance on AMD GPUs is similar to NVIDIA GPUs Take your time on this slide. Need to be clear and precise about the bottom two comments and how they tie-in above. For example, people will be shocked that there was only a ~3X improvement when to show the next slide. But our initial guess is that the new approach will be 2x faster Now we have a more general gather kernel And portable implementation. We did compare the results on CPUs with GPUs and tested with OpenCL and CUDA. 23 23

24 New Approach Kernel: NVIDIA vs AMD
Million migration contributions/s Take your time on this slide. Need to be clear and precise about the bottom two comments and how they tie-in above. For example, people will be shocked that there was only a ~3X improvement when to show the next slide. We do the same test again, keeping the input size fixed and vary the output image volume with NVIDIA and AMD GPU We see that on K40, CUDA does a little better than OpenCL. Interesting thing is that AMD Tahiti card perform better than NVIDIA K40, But we can say confidently when we run a full production migration. Output image volume size in meters Number of input traces ~177,000 24 24

25 Project Goals Review Future Work Hardware portability
General image gathers Improve migration performance We started our project with three goals. We are indeed going in the right direction now. Now we have to integrate the new kernel into production In the future we would like to do more testing with various accelerators through the flexibility of OCCA And explore using hybrid architectures (CPU & GPU) for migration Future Work Finish integration of new kernel in to production More testing on various accelerators Explore using mixed architecture migrations 25 25

26 Acknowledgements Hess Corporation CAAM @ Rice University Tim Warburton
David Medina 26 26


Download ppt "Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop"

Similar presentations


Ads by Google