Download presentation
Presentation is loading. Please wait.
Published byRoland Lindsey Modified over 6 years ago
1
Workshop in Nihzny Novgorod State University Activity Report
Alexey Iliasov ( ) Kyrgyz Russian Slavic University
2
Goals of the project Research:
- implementation approaches - applicability - real-life applications targeting Implement: - simple profiler - analysis tool
3
Implementation Approaches
levels of abstraction - hardware level - machine instructions level - assembly language level - compiler level - source code level - library level
4
GNU Family Compilers - supports many languages supports many targets provides lots of optimisations techniques open source available under the terms of the GPL
5
GNU Family Compilers machine independent ports exist for more then 30 platforms high code generation quality intensive optimisation RTL - Register Transfer Language reusability ,000 lines of language and platform independent routines.
6
GNU Family Compilers weird internal structure written in mix of C and C++ modularity problems lack of good documentation
7
GCC infrastructure 25 optimization passes + assembler generation
source parser 25 optimization passes + assembler generation tree optimisation target back end RTL debug info language front-end binary
8
based on tree transformation
Mudflap C/C++ bounds checker based on tree transformation instruments program to detect memory access errors tracks call to many library functions provides replacements for common C library functions
9
memory profiler for GCC
Mudzzi memory profiler for GCC based on mudflap approach development considerations high performance language independent large-scale applications minimization of inlined code multi-threading support online or post-mortem analysis
10
memory profiler for GCC
Mudzzi memory profiler for GCC tracked events read/write memory accesses object declarations object destructions (for stack-frame objects) calls to malloc, calloc, realloc, mmap and free timing
11
Mudzzi two record types: normal prefix record
records format two record types: normal prefix record length prefixed prefix length record Memory Read/Write record: record type: 32 bits access address : 32 bits RTDSC cpu tick value : 64 bits source line number : 32 bits base pointer address : 32 bits size of accessed region : 32 bits coded source file an function name : 32 bits
12
Mudzzi code transformation original instrumented
void foo() { int a = 3; mpf_vardecl(&a, sizeof(int), 0, “a”, .., ..); int b[100]; mpf_vardecl(b, sizeof(int)*100, 0, “b”, .., ..); b[a] = 10; mpf_add(b+a, a, b, 1, .., ..); mpf_varundecl(a, .., ..); mpf_varundecl(b, .., ..); return; } void foo() { int a = 3; int b[100]; b[a] = 10; return; }
13
profiled code performance ~20% of original
Mudzzi profiled code performance ~20% of original
14
dump file size problem: grows very fast
Mudzzi dump file size problem: grows very fast
15
Visualization and analysis tool for memory profiler
16
features overview 1.Visualization of memory profiler dump
2.Cycles detection 3.Array access analysis inside detected cycles 4.Reuse distance calculation for arrays 5.Cache hit/miss rate, analysis and explanations
17
address/time diagram example array access pattern
addresses time by rows by columns
18
address/time diagram array access pattern
19
cache config and report
20
Blocked Matrix Multiply cache interference
void BlkMatrixMultiply (etype *X, etype *Y, etype *Z, int N, int B) { int w, q, i, j, k; etype r; for (w = 0; w < N; w += b) for (q = 0; q < N; q += b) for (i = 0; i < N; i++) for (k = w; k < MIN (w + b, N); k++) { r = *(X + i * N + k); for (j = q; j < MIN (q + b, N); j++) *(Z + i * N + j) += *(Y + k * N + j) * r; } where N - matrix size, B - block size we use N = 128, B = 32 and arrays are a[N][N], b[N][N], c[N][N]
21
Blocked Matrix Multiply cache interference
Full view of cache utilization report
22
Blocked Matrix Multiply cache interference
VARIABLE `a' hit rate:81% Replacement causes: `b' (0xbffe78d0:49152) replacements (62%) `c' (0xbffdb8d0:49152) - 57 replacements (3%) interference with b self interference VARIABLE `b' hit rate:81% Replacement causes: `a' (0xbfff38d0:49152) replacements (1%) `c' (0xbffdb8d0:49152) replacements (3%) VARIABLE `c' hit rate:99% Replacement causes: `a' (0xbfff38d0:49152) replacements (7%) `b' (0xbffe78d0:49152) replacements (91%)
23
number of distinct object references between two reuses
Reuse distance number of distinct object references between two reuses RD = 4 (e, c, a, d) a d f e c e a c a a c e d e a c d f a e a c RD = 4 (d, f, e, c) - not a time but address distance measure - closely related to hit rate for LRU/FIFO caches - leads to an effective and easy to apply optimisation
24
finding groups of variables commonly used together
Clustering finding groups of variables commonly used together a b c d e f a d f e c e a c a a c e d e a c d f a e a c a 5 1 4 1 b a-c b d e f a-c c 1 3.5 0.5 1 3 b d 2 1 d e 2 1 1 e f 1 f
25
profiler implementation (as GCC module)
Results of the project profiler implementation (as GCC module) Benefits: - good analysis capabilities and binding to sources - good performance - ease of use Problems: - ineffective (full code coverage) - part of another program
26
Results of the project applicability
- instrumentation effectively works for large-scale applications - reasonable performance penalty - platform/OS independent Problems: - lack of remote analysis - GCC-centric
27
Results of the project analysis tool - visual diagrams
- cache analysis - binding to source-level - flexible Problems: - poor representation for long-running large applications - too few analysis tools - some tests/tools stuck on large dump files
28
profiling the profiler
Results of the project profiling the profiler
29
Results of the project glance at future
- consider DIOTA as instrumentation basis - implement remote analysis - multiple specific profilers within one analysis tool - add support for HT/SMP architectures
30
That's all Thank you! Iliasov Alexey Kyrgyz Russian Slavic University Kyrgyzstan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.