Download presentation
Presentation is loading. Please wait.
Published byBartholomew Simpson Modified over 9 years ago
1
The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz Wiatr 12 1 AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, 2 ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków RUC 17-18.09.2015 Kraków
2
Agenda GPU acceleration Code analysis and instrumentation Experiments Results Conclusion and future work 2
3
GPU as modern hardware accelerators Computing power (over 1 Tflops) Availability High parallelism (SIMT architecture) High level programming tools (CUDA, OpenCL) 3
4
GPU hardware accelerators Number of algorithms from different domains implemented in GPU: Linear algebra (e.g. cublas, cula) Deep learning, neural networks, machine learning algorithms (e.g. SVM) Computational intelligence (e.g. genetic, memetic algorithms) Data and text mining 4
5
Code analysis Implementation should be preceded by appropriate analysis Analysis can be automated Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution Profiling as dynamic analysis 5
6
Byte code analysis and instrumentation Byte code analysis just in time Apprioprate instrumentation for profiling and static analysis Results of analysis and profiling can be used for implementation 6
7
System architecture 7
8
Byte code instrumentation instrumenting array data read instructions instrumenting array data write instructions instrumenting array data read and write instructions for counting number of accesses and standard deviation, instrumenting single variables read and write for counting number of accesses. 8
9
Byte code instrumentation 9 for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; } for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } 27: iconst_1 28: istore%6 30: iload%6 32: bipush100 34: if_icmpge #93 37: aload_1 38: iload%6 40: bipush100 42: iastore 43: aload_3 44: iload%6 46: iload%6 48: iastore. 70: if_icmpge #87 73: aload%5 75: iload%6 77: iconst_1 78: isub 79: iload%6 81: aload_3 82: iload%6 84: iaload 85: isub 86: iastore 87: iinc%61 90: goto#30
10
GPU implementation rules if data is reused between iterations (between threads) this data should be transfer to shared memory, data reused by only single iteration should be transfer to local memory (registers), data which is reused, read only and without regular accesses should be allocated in texture memory, 10
11
GPU implementation rules common constant values used by threads should be write to constant memory, data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing. 11
12
JCuda generation Implementation can be done manually or partly in automated way Rules generate some parallel code patterns 12
13
Experimental results 13 size of matrixGPU time [ms] CPU time (MKL BLAS) [ms] 256×2560.46 512×5124.324 1024×102434158 2048×2048285956 4096×40962817990
14
Conlusions and future work Implementation preceded by source code analysis helps adaption algorithm in GPU Automated parallel code generation in GPU save a lot of time Based on byte code = portable Optimizations in code generation must be done furter in our system (memories access patterns) 14
15
Questions 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.