Workshop in Nihzny Novgorod State University Activity Report Alexey Iliasov ( ) Kyrgyz Russian Slavic University
Goals of the project Research: - implementation approaches - applicability - real-life applications targeting Implement: - simple profiler - analysis tool
Implementation Approaches levels of abstraction - hardware level - machine instructions level - assembly language level - compiler level - source code level - library level
GNU Family Compilers - supports many languages - supports many targets - provides lots of optimisations techniques - open source - available under the terms of the GPL
GNU Family Compilers machine independent ports exist for more then 30 platforms high code generation quality intensive optimisation RTL - Register Transfer Language reusability 225,000 lines of language and platform independent routines.
GNU Family Compilers weird internal structure written in mix of C and C++ modularity problems lack of good documentation
GCC infrastructure 25 optimization passes + assembler generation source parser 25 optimization passes + assembler generation tree optimisation target back end RTL debug info language front-end binary
based on tree transformation Mudflap C/C++ bounds checker based on tree transformation instruments program to detect memory access errors tracks call to many library functions provides replacements for common C library functions
memory profiler for GCC Mudzzi memory profiler for GCC based on mudflap approach development considerations high performance language independent large-scale applications minimization of inlined code multi-threading support online or post-mortem analysis
memory profiler for GCC Mudzzi memory profiler for GCC tracked events read/write memory accesses object declarations object destructions (for stack-frame objects) calls to malloc, calloc, realloc, mmap and free timing
Mudzzi two record types: normal prefix record records format two record types: normal prefix record length prefixed prefix length record Memory Read/Write record: record type: 32 bits access address : 32 bits RTDSC cpu tick value : 64 bits source line number : 32 bits base pointer address : 32 bits size of accessed region : 32 bits coded source file an function name : 32 bits
Mudzzi code transformation original instrumented void foo() { int a = 3; mpf_vardecl(&a, sizeof(int), 0, “a”, .., ..); int b[100]; mpf_vardecl(b, sizeof(int)*100, 0, “b”, .., ..); b[a] = 10; mpf_add(b+a, a, b, 1, .., ..); mpf_varundecl(a, .., ..); mpf_varundecl(b, .., ..); return; } void foo() { int a = 3; int b[100]; b[a] = 10; return; }
profiled code performance ~20% of original Mudzzi profiled code performance ~20% of original
dump file size problem: grows very fast Mudzzi dump file size problem: grows very fast
Visualization and analysis tool for memory profiler
features overview 1.Visualization of memory profiler dump 2.Cycles detection 3.Array access analysis inside detected cycles 4.Reuse distance calculation for arrays 5.Cache hit/miss rate, analysis and explanations
address/time diagram example array access pattern addresses time by rows by columns
address/time diagram array access pattern
cache config and report
Blocked Matrix Multiply cache interference void BlkMatrixMultiply (etype *X, etype *Y, etype *Z, int N, int B) { int w, q, i, j, k; etype r; for (w = 0; w < N; w += b) for (q = 0; q < N; q += b) for (i = 0; i < N; i++) for (k = w; k < MIN (w + b, N); k++) { r = *(X + i * N + k); for (j = q; j < MIN (q + b, N); j++) *(Z + i * N + j) += *(Y + k * N + j) * r; } where N - matrix size, B - block size we use N = 128, B = 32 and arrays are a[N][N], b[N][N], c[N][N]
Blocked Matrix Multiply cache interference Full view of cache utilization report
Blocked Matrix Multiply cache interference VARIABLE `a' hit rate:81% Replacement causes: `b' (0xbffe78d0:49152) - 961 replacements (62%) `c' (0xbffdb8d0:49152) - 57 replacements (3%) interference with b self interference VARIABLE `b' hit rate:81% Replacement causes: `a' (0xbfff38d0:49152) - 637 replacements (1%) `c' (0xbffdb8d0:49152) - 1607 replacements (3%) VARIABLE `c' hit rate:99% Replacement causes: `a' (0xbfff38d0:49152) - 130 replacements (7%) `b' (0xbffe78d0:49152) - 1579 replacements (91%)
number of distinct object references between two reuses Reuse distance number of distinct object references between two reuses RD = 4 (e, c, a, d) a d f e c e a c a a c e d e a c d f a e a c RD = 4 (d, f, e, c) - not a time but address distance measure - closely related to hit rate for LRU/FIFO caches - leads to an effective and easy to apply optimisation
finding groups of variables commonly used together Clustering finding groups of variables commonly used together a b c d e f a d f e c e a c a a c e d e a c d f a e a c a 5 1 4 1 b a-c b d e f a-c c 1 3.5 0.5 1 3 b d 2 1 d e 2 1 1 e f 1 f
profiler implementation (as GCC module) Results of the project profiler implementation (as GCC module) Benefits: - good analysis capabilities and binding to sources - good performance - ease of use Problems: - ineffective (full code coverage) - part of another program
Results of the project applicability - instrumentation effectively works for large-scale applications - reasonable performance penalty - platform/OS independent Problems: - lack of remote analysis - GCC-centric
Results of the project analysis tool - visual diagrams - cache analysis - binding to source-level - flexible Problems: - poor representation for long-running large applications - too few analysis tools - some tests/tools stuck on large dump files
profiling the profiler Results of the project profiling the profiler
Results of the project glance at future - consider DIOTA as instrumentation basis - implement remote analysis - multiple specific profilers within one analysis tool - add support for HT/SMP architectures
That's all Thank you! Iliasov Alexey Kyrgyz Russian Slavic University Kyrgyzstan