Workshop in Nihzny Novgorod State University Activity Report

Slides:

Advertisements

Similar presentations

Extending Eclipse CDT for Remote Target Debugging Thomas Fletcher Director, Automotive Engineering Services QNX Software Systems.

Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

UEE072HM Linking HLL and ALP An example on ARM. Embedded and Real-Time Systems We will mainly look at embedded systems –Systems which have the computer.

Dec 5, 2007University of Virginia1 Efficient Dynamic Tainting using Multiple Cores Yan Huang University of Virginia Dec

Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Project Testing; Processor Examples. Project Testing --thorough, efficient, hierarchical --done by “independent tester” --well-documented, repeatable.

1 UQC122S3 Real-Time and Embedded Systems GCC as a cross compiler.

Computer System Overview

Memory Management 2010.

Computer System Overview

Author: Texas Instruments ®, Sitara™ ARM ® Processors Building Blocks for PRU Development Module 2 PRU Firmware Development This session covers how to.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.

CS 350 Operating Systems & Programming Languages Ethan Race Oren Rasekh Christopher Roberts Christopher Rogers Anthony Simon Benjamin Ramos.

Antigone Engine Kevin Kassing – Period

JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.

Types for Programs and Proofs Lecture 1. What are types? int, float, char, …, arrays types of procedures, functions, references, records, objects,...

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 A Simple but Realistic Assembly Language for a Course in Computer Organization Eric Larson Moon Ok Kim Seattle University October 25, 2008.

Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.

1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

CSCI Rational Purify 1 Rational Purify Overview Michel Izygon - Jim Helm.

Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.

Chapter 1 Introduction. Chapter 1 -- Introduction2  Def: Compiler --  a program that translates a program written in a language like Pascal, C, PL/I,

C LANGUAGE Characteristics of C · Small size

Antigone Engine. Introduction Antigone = “Counter Generation” Library of functions for simplifying 3D application development Written in C for speed (compatible.

By Anand George SourceLens.org Copyright. All rights reserved. Content Owner - Meera R (meera at sourcelens.org)

Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.

LLVM IR, File - Praakrit Pradhan. Overview The LLVM bitcode has essentially two things A bitstream container format Encoding of LLVM IR.

1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Writing Functions in Assembly

MODERN OPERATING SYSTEMS Third Edition ANDREW S

The architecture of the P416 compiler

Improving the support for ARM in IgProf

PRINCIPLES OF COMPILER DESIGN

Chapter 1 Introduction.

Introduction to Compiler Construction

Debugging Memory Issues

Antigone Engine.

Types for Programs and Proofs

CSCI-235 Micro-Computer Applications

Recitation 6: C Review 30 Sept 2016.

Chapter 1 Introduction.

课程名编译原理 Compiling Techniques

Run-time organization

Overview Introduction General Register Organization Stack Organization

Writing Functions in Assembly

CSCI/CMPE 3334 Systems Programming

Many-core Software Development Platforms

Instructions - Type and Format

Code Generation.

Performance Optimization for Embedded Software

The University of Adelaide, School of Computer Science

Lecture 2 SCOPE – Local and Global variables

Virtual Memory Overcoming main memory size limitation

Chapter 1 Introduction.

Main Memory Background

Lecture 4: Instruction Set Design/Pipelining

Dynamic Binary Translators and Instrumenters

Overview of Exception Handling Implementation in Open64

Introduction to C CS 3410.

Presentation transcript:

Workshop in Nihzny Novgorod State University Activity Report Alexey Iliasov ( alexili@soros.kg ) Kyrgyz Russian Slavic University

Goals of the project Research: - implementation approaches - applicability - real-life applications targeting Implement: - simple profiler - analysis tool

Implementation Approaches levels of abstraction - hardware level - machine instructions level - assembly language level - compiler level - source code level - library level

GNU Family Compilers - supports many languages - supports many targets - provides lots of optimisations techniques - open source - available under the terms of the GPL

GNU Family Compilers machine independent ports exist for more then 30 platforms high code generation quality intensive optimisation RTL - Register Transfer Language reusability 225,000 lines of language and platform independent routines.

GNU Family Compilers weird internal structure written in mix of C and C++ modularity problems lack of good documentation

GCC infrastructure 25 optimization passes + assembler generation source parser 25 optimization passes + assembler generation tree optimisation target back end RTL debug info language front-end binary

based on tree transformation Mudflap C/C++ bounds checker based on tree transformation instruments program to detect memory access errors tracks call to many library functions provides replacements for common C library functions

memory profiler for GCC Mudzzi memory profiler for GCC based on mudflap approach development considerations high performance language independent large-scale applications minimization of inlined code multi-threading support online or post-mortem analysis

memory profiler for GCC Mudzzi memory profiler for GCC tracked events read/write memory accesses object declarations object destructions (for stack-frame objects) calls to malloc, calloc, realloc, mmap and free timing

Mudzzi two record types: normal prefix record records format two record types: normal prefix record length prefixed prefix length record Memory Read/Write record: record type: 32 bits access address : 32 bits RTDSC cpu tick value : 64 bits source line number : 32 bits base pointer address : 32 bits size of accessed region : 32 bits coded source file an function name : 32 bits

Mudzzi code transformation original instrumented void foo() { int a = 3; mpf_vardecl(&a, sizeof(int), 0, “a”, .., ..); int b[100]; mpf_vardecl(b, sizeof(int)*100, 0, “b”, .., ..); b[a] = 10; mpf_add(b+a, a, b, 1, .., ..); mpf_varundecl(a, .., ..); mpf_varundecl(b, .., ..); return; } void foo() { int a = 3; int b[100]; b[a] = 10; return; }

profiled code performance ~20% of original Mudzzi profiled code performance ~20% of original

dump file size problem: grows very fast Mudzzi dump file size problem: grows very fast

Visualization and analysis tool for memory profiler

features overview 1.Visualization of memory profiler dump 2.Cycles detection 3.Array access analysis inside detected cycles 4.Reuse distance calculation for arrays 5.Cache hit/miss rate, analysis and explanations

address/time diagram example array access pattern addresses time by rows by columns

address/time diagram array access pattern

cache config and report

Blocked Matrix Multiply cache interference void BlkMatrixMultiply (etype *X, etype *Y, etype *Z, int N, int B) { int w, q, i, j, k; etype r; for (w = 0; w < N; w += b) for (q = 0; q < N; q += b) for (i = 0; i < N; i++) for (k = w; k < MIN (w + b, N); k++) { r = *(X + i * N + k); for (j = q; j < MIN (q + b, N); j++) *(Z + i * N + j) += *(Y + k * N + j) * r; } where N - matrix size, B - block size we use N = 128, B = 32 and arrays are a[N][N], b[N][N], c[N][N]

Blocked Matrix Multiply cache interference Full view of cache utilization report

Blocked Matrix Multiply cache interference VARIABLE à' hit rate:81% Replacement causes: `b' (0xbffe78d0:49152) - 961 replacements (62%) `c' (0xbffdb8d0:49152) - 57 replacements (3%) interference with b self interference VARIABLE `b' hit rate:81% Replacement causes: à' (0xbfff38d0:49152) - 637 replacements (1%) `c' (0xbffdb8d0:49152) - 1607 replacements (3%) VARIABLE `c' hit rate:99% Replacement causes: à' (0xbfff38d0:49152) - 130 replacements (7%) `b' (0xbffe78d0:49152) - 1579 replacements (91%)

number of distinct object references between two reuses Reuse distance number of distinct object references between two reuses RD = 4 (e, c, a, d) a d f e c e a c a a c e d e a c d f a e a c RD = 4 (d, f, e, c) - not a time but address distance measure - closely related to hit rate for LRU/FIFO caches - leads to an effective and easy to apply optimisation

finding groups of variables commonly used together Clustering finding groups of variables commonly used together a b c d e f a d f e c e a c a a c e d e a c d f a e a c a 5 1 4 1 b a-c b d e f a-c c 1 3.5 0.5 1 3 b d 2 1 d e 2 1 1 e f 1 f

profiler implementation (as GCC module) Results of the project profiler implementation (as GCC module) Benefits: - good analysis capabilities and binding to sources - good performance - ease of use Problems: - ineffective (full code coverage) - part of another program

Results of the project applicability - instrumentation effectively works for large-scale applications - reasonable performance penalty - platform/OS independent Problems: - lack of remote analysis - GCC-centric

Results of the project analysis tool - visual diagrams - cache analysis - binding to source-level - flexible Problems: - poor representation for long-running large applications - too few analysis tools - some tests/tools stuck on large dump files

profiling the profiler Results of the project profiling the profiler

Results of the project glance at future - consider DIOTA as instrumentation basis - implement remote analysis - multiple specific profilers within one analysis tool - add support for HT/SMP architectures

That's all Thank you! Iliasov Alexey Kyrgyz Russian Slavic University Kyrgyzstan